Mantis - Resin
Viewing Issue Advanced Details
4355 minor always 01-24-11 09:19 02-11-11 16:17
ferg  
ferg  
normal  
closed  
fixed  
none    
none 4.0.15  
0004355: cache timeout issues
(rep by Jamison Novak)

At least once a day (but usually more frequently), we run into the following error:

[2011/01/24 10:38:56.908] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@5f997454 query timeout: 60000
                            to: cluster-cache@caa.wad.admin.resin
                            query: CacheGet[key=e89bd3ca,version=0,server=2]
                                at com.caucho.cloud.bam.TriadFuture.get(TriadFuture.java:65)
                                at com.caucho.cloud.bam.TriadFirstQuery.get(TriadFirstQuery.java:179)
                                at com.caucho.cloud.bam.BamTriadSender.queryTriadFirstRemote(BamTriadSender.java:642)
                                at com.caucho.server.distcache.CacheMnodeActor.get(CacheMnodeActor.java:126)
                                at com.caucho.server.distcache.CacheMnodeManager.get(CacheMnodeManager.java:137)
                                at com.caucho.server.distcache.CacheClusterBackingImpl.loadClusterValue(CacheClusterBackingImpl.java:53)
                                at com.caucho.server.distcache.AbstractCacheManager.loadExpiredValue(AbstractCacheManager.java:311)
                                at com.caucho.server.distcache.AbstractCacheManager.reloadValue(AbstractCacheManager.java:271)
                                at com.caucho.server.distcache.AbstractCacheManager.getMnodeValue(AbstractCacheManager.java:247)
                                at com.caucho.server.distcache.AbstractCacheManager.get(AbstractCacheManager.java:168)
                                at com.caucho.server.distcache.ProCacheEntry.get(ProCacheEntry.java:81)
                                at com.caucho.distcache.AbstractCache.get(AbstractCache.java:465)
                                at com.caucho.server.repository.ClusterRepository.checkForUpdate(ClusterRepository.java:144)
                                at com.caucho.server.repository.ClusterRepository.start(ClusterRepository.java:110)
                                at com.caucho.env.repository.RepositoryService.start(RepositoryService.java:114)
                                at com.caucho.env.service.ResinSystem.startServices(ResinSystem.java:480)
                                at com.caucho.env.service.ResinSystem.start(ResinSystem.java:452)
                                at com.caucho.server.resin.Resin.start(Resin.java:889)
                                at com.caucho.server.resin.Resin.initMain(Resin.java:1017)
                                at com.caucho.server.resin.Resin.main(Resin.java:1284)


At this point, about the only thing we can do is bounce the instance and hope it comes back up cleanly.

This morning, we ran into that problem again. I bounced the instance on all 8 of our webservers and 7 of them came up cleanly. One of them did not.

The watchdog log is showing this:

[2011/01/24 10:48:41.656] {watchdog-wad-c} WatchdogChild[wad-c] starting
[2011/01/24 10:48:41.657] {watchdog-wad-c} Watchdog starting Resin[wad-c]
[2011/01/24 10:49:51.120] {watchdog-wad-c} Watchdog detected close of Resin[wad-c,pid=16232]
                                             exit reason: BAD_CONFIG (exit code=2)
[2011/01/24 10:49:51.120] {watchdog-wad-c} WatchdogChild[wad-c] starting
[2011/01/24 10:49:51.121] {watchdog-wad-c} Watchdog starting Resin[wad-c]

That just repeats as it tries to bring the instance up, decides for whatever reason that it has a BAD_CONFIG, then restarts it again. One thing to note: all 8 servers are configured identically from an operating system and Resin perspective. The resin.xml file is not bad.

When I turned logging on this server up to "finer", I am getting roughly 1MB/sec. of log entries relating to "unknown actor" and cluster-cache things. Here's a sample:


[2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,INIT] [^] starting listen thread
[2011/01/24 10:45:05.726] TcpSocketLink[id=13,wad-c] starting connection TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT], [^] total=0
[2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT] [^] accept
[2011/01/24 10:45:05.726] Hmux[wad-c:13] start request
[2011/01/24 10:45:05.726] Hmux[wad-c:13] 7-r switch-to-hmtp
[2011/01/24 10:45:05.727] TcpSocketLink[id=server://172.27.240.13:6850-14,server://172.27.240.13:6850,INIT] [^] starting listen thread
[2011/01/24 10:45:05.733] HempBroker[caa.wad.admin.resin]: message to unknown actor
                            to: cluster-cache-data@caa.wad.admin.resin
                            from: cluster-cache-data@aaa.wad.admin.resin
                            payload: DataPut[value=62589522]
[2011/01/24 10:45:05.735] HeartbeatActor[cluster-heartbeat@caa.wad.admin.resin] HeartbeatMessage[7] {to:cluster-heartbeat@caa.wad.admin.resin,from:cluster-heartbeat@haa.wad.admin.resin}
[2011/01/24 10:45:05.739] HempBroker[caa.wad.admin.resin]: message to unknown actor
                            to: cluster-cache-data@caa.wad.admin.resin
                            from: cluster-cache-data@baa.wad.admin.resin
                            payload: DataPut[value=3e421d4c]
[2011/01/24 10:45:05.741] HempBroker[caa.wad.admin.resin]: query(16292) to unknown actor to:cluster-cache-data@caa.wad.admin.resin
                            from:cluster-cache-data@baa.wad.admin.resin
                            payload:DataPutTriad[value=62589522]


We eventually get to a point where the following sample is all we see repeating over and over in the log and the appserver never actually logs that it started:


[2011/01/24 10:56:47.362] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] DataPutTriad[value=8d07bf29] cluster-cache-data@haa.wad.admin.resin DataPutTriad[value=8d07bf29]
[2011/01/24 10:56:47.363] DataStore[data] updateExpires HashKey[8d07bf29]
[2011/01/24 10:56:47.363] DataStore[data] insert HashKey[8d07bf29] length:-1
[2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.363] cluster-cache@caa put from=cluster-cache@baa.wad.admin.resin CachePut[key=0b436e6e,value=8d07bf29,flags=6,version=13,lease=7,source=null]
[2011/01/24 10:56:47.363] MnodeStore[wad-c] updateSave key=HashKey[0b436e6e] value=HashKey[8d07bf29]
[2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@aaa.wad.admin.resin]: message is not implemented for this payload
                            DataPut[value=8d07bf29]
                            {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@aaa.wad.admin.resin}] cluster-cache-data@aaa.wad.admin.resin {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@baa.wad.admin.resin]: message is not implemented for this payload
                            DataPut[value=8d07bf29]
                            {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@baa.wad.admin.resin}] cluster-cache-data@baa.wad.admin.resin {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.448] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@31f32e51 query timeout: 60000
                            to: cluster-cache@caa.wad.admin.resin
                            query: CacheGet[key=e89bd3ca,version=0,server=2]



Do you have any ideas as to what may be causing this? We currently have always-save-session and always-load-session enabled while we troubleshoot session issues. Is it possible that we're generating more data than Resin can reasonably handle? (Seems unlikely, but...)


There are no notes attached to this issue.