0004355: cache timeout issues

Viewing Issue Simple Details [ Jump to Notes ]

[ View Advanced ] [ Issue History ] [ Print ]

Category

Severity

Reproducibility

Date Submitted

Last Update

0004355

[Resin]

minor

always

01-24-11 09:19

02-11-11 16:17

Reporter

ferg

View Status

public

Assigned To

ferg

Priority

normal

Resolution

fixed

Status

closed

Product Version

Summary

0004355: cache timeout issues

Description

(rep by Jamison Novak)

At least once a day (but usually more frequently), we run into the following error:

[2011/01/24 10:38:56.908] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@5f997454 query timeout: 60000
                            to: cluster-cache@caa.wad.admin.resin
                            query: CacheGet[key=e89bd3ca,version=0,server=2]
                                at com.caucho.cloud.bam.TriadFuture.get(TriadFuture.java:65)
                                at com.caucho.cloud.bam.TriadFirstQuery.get(TriadFirstQuery.java:179)
                                at com.caucho.cloud.bam.BamTriadSender.queryTriadFirstRemote(BamTriadSender.java:642)
                                at com.caucho.server.distcache.CacheMnodeActor.get(CacheMnodeActor.java:126)
                                at com.caucho.server.distcache.CacheMnodeManager.get(CacheMnodeManager.java:137)
                                at com.caucho.server.distcache.CacheClusterBackingImpl.loadClusterValue(CacheClusterBackingImpl.java:53)
                                at com.caucho.server.distcache.AbstractCacheManager.loadExpiredValue(AbstractCacheManager.java:311)
                                at com.caucho.server.distcache.AbstractCacheManager.reloadValue(AbstractCacheManager.java:271)
                                at com.caucho.server.distcache.AbstractCacheManager.getMnodeValue(AbstractCacheManager.java:247)
                                at com.caucho.server.distcache.AbstractCacheManager.get(AbstractCacheManager.java:168)
                                at com.caucho.server.distcache.ProCacheEntry.get(ProCacheEntry.java:81)
                                at com.caucho.distcache.AbstractCache.get(AbstractCache.java:465)
                                at com.caucho.server.repository.ClusterRepository.checkForUpdate(ClusterRepository.java:144)
                                at com.caucho.server.repository.ClusterRepository.start(ClusterRepository.java:110)
                                at com.caucho.env.repository.RepositoryService.start(RepositoryService.java:114)
                                at com.caucho.env.service.ResinSystem.startServices(ResinSystem.java:480)
                                at com.caucho.env.service.ResinSystem.start(ResinSystem.java:452)
                                at com.caucho.server.resin.Resin.start(Resin.java:889)
                                at com.caucho.server.resin.Resin.initMain(Resin.java:1017)
                                at com.caucho.server.resin.Resin.main(Resin.java:1284)

At this point, about the only thing we can do is bounce the instance and hope it comes back up cleanly.

This morning, we ran into that problem again. I bounced the instance on all 8 of our webservers and 7 of them came up cleanly. One of them did not.

The watchdog log is showing this:

[2011/01/24 10:48:41.656] {watchdog-wad-c} WatchdogChild[wad-c] starting
[2011/01/24 10:48:41.657] {watchdog-wad-c} Watchdog starting Resin[wad-c]
[2011/01/24 10:49:51.120] {watchdog-wad-c} Watchdog detected close of Resin[wad-c,pid=16232]
                                             exit reason: BAD_CONFIG (exit code=2)
[2011/01/24 10:49:51.120] {watchdog-wad-c} WatchdogChild[wad-c] starting
[2011/01/24 10:49:51.121] {watchdog-wad-c} Watchdog starting Resin[wad-c]

That just repeats as it tries to bring the instance up, decides for whatever reason that it has a BAD_CONFIG, then restarts it again. One thing to note: all 8 servers are configured identically from an operating system and Resin perspective. The resin.xml file is not bad.

When I turned logging on this server up to "finer", I am getting roughly 1MB/sec. of log entries relating to "unknown actor" and cluster-cache things. Here's a sample:

[2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,INIT] [^] starting listen thread
[2011/01/24 10:45:05.726] TcpSocketLink[id=13,wad-c] starting connection TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT], [^] total=0
[2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT] [^] accept
[2011/01/24 10:45:05.726] Hmux[wad-c:13] start request
[2011/01/24 10:45:05.726] Hmux[wad-c:13] 7-r switch-to-hmtp
[2011/01/24 10:45:05.727] TcpSocketLink[id=server://172.27.240.13:6850-14,server://172.27.240.13:6850,INIT] [^] starting listen thread
[2011/01/24 10:45:05.733] HempBroker[caa.wad.admin.resin]: message to unknown actor
                            to: cluster-cache-data@caa.wad.admin.resin
                            from: cluster-cache-data@aaa.wad.admin.resin
                            payload: DataPut[value=62589522]
[2011/01/24 10:45:05.735] HeartbeatActor[cluster-heartbeat@caa.wad.admin.resin] HeartbeatMessage[7] {to:cluster-heartbeat@caa.wad.admin.resin,from:cluster-heartbeat@haa.wad.admin.resin}
[2011/01/24 10:45:05.739] HempBroker[caa.wad.admin.resin]: message to unknown actor
                            to: cluster-cache-data@caa.wad.admin.resin
                            from: cluster-cache-data@baa.wad.admin.resin
                            payload: DataPut[value=3e421d4c]
[2011/01/24 10:45:05.741] HempBroker[caa.wad.admin.resin]: query(16292) to unknown actor to:cluster-cache-data@caa.wad.admin.resin
                            from:cluster-cache-data@baa.wad.admin.resin
                            payload:DataPutTriad[value=62589522]

We eventually get to a point where the following sample is all we see repeating over and over in the log and the appserver never actually logs that it started:

[2011/01/24 10:56:47.362] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] DataPutTriad[value=8d07bf29] cluster-cache-data@haa.wad.admin.resin DataPutTriad[value=8d07bf29]
[2011/01/24 10:56:47.363] DataStore[data] updateExpires HashKey[8d07bf29]
[2011/01/24 10:56:47.363] DataStore[data] insert HashKey[8d07bf29] length:-1
[2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.363] cluster-cache@caa put from=cluster-cache@baa.wad.admin.resin CachePut[key=0b436e6e,value=8d07bf29,flags=6,version=13,lease=7,source=null]
[2011/01/24 10:56:47.363] MnodeStore[wad-c] updateSave key=HashKey[0b436e6e] value=HashKey[8d07bf29]
[2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@aaa.wad.admin.resin]: message is not implemented for this payload
                            DataPut[value=8d07bf29]
                            {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@aaa.wad.admin.resin}] cluster-cache-data@aaa.wad.admin.resin {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@baa.wad.admin.resin]: message is not implemented for this payload
                            DataPut[value=8d07bf29]
                            {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@baa.wad.admin.resin}] cluster-cache-data@baa.wad.admin.resin {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.448] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@31f32e51 query timeout: 60000
                            to: cluster-cache@caa.wad.admin.resin
                            query: CacheGet[key=e89bd3ca,version=0,server=2]

Do you have any ideas as to what may be causing this? We currently have always-save-session and always-load-session enabled while we troubleshoot session issues. Is it possible that we're generating more data than Resin can reasonably handle? (Seems unlikely, but...)

Additional Information

Attached Files

Relationships

There are no notes attached to this issue.

Issue History
Date Modified	Username	Field	Change
01-24-11 09:19	ferg	New Issue
01-24-11 09:35	jnovak	Issue Monitored: jnovak
02-11-11 16:17	ferg	Assigned To	=> ferg
02-11-11 16:17	ferg	Status	new => closed
02-11-11 16:17	ferg	Resolution	open => fixed
02-11-11 16:17	ferg	Fixed in Version	=> 4.0.15

Mantis 1.0.0rc3[^]

28 total queries executed.
25 unique queries executed.