Anonymous | Login | Signup for a new account | 12-17-2024 08:29 PST |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ View Advanced ] [ Issue History ] [ Print ] | ||||||||
ID | Category | Severity | Reproducibility | Date Submitted | Last Update | ||||
0004355 | [Resin] | minor | always | 01-24-11 09:19 | 02-11-11 16:17 | ||||
Reporter | ferg | View Status | public | ||||||
Assigned To | ferg | ||||||||
Priority | normal | Resolution | fixed | ||||||
Status | closed | Product Version | |||||||
Summary | 0004355: cache timeout issues | ||||||||
Description |
(rep by Jamison Novak) At least once a day (but usually more frequently), we run into the following error: [2011/01/24 10:38:56.908] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@5f997454 query timeout: 60000 to: cluster-cache@caa.wad.admin.resin query: CacheGet[key=e89bd3ca,version=0,server=2] at com.caucho.cloud.bam.TriadFuture.get(TriadFuture.java:65) at com.caucho.cloud.bam.TriadFirstQuery.get(TriadFirstQuery.java:179) at com.caucho.cloud.bam.BamTriadSender.queryTriadFirstRemote(BamTriadSender.java:642) at com.caucho.server.distcache.CacheMnodeActor.get(CacheMnodeActor.java:126) at com.caucho.server.distcache.CacheMnodeManager.get(CacheMnodeManager.java:137) at com.caucho.server.distcache.CacheClusterBackingImpl.loadClusterValue(CacheClusterBackingImpl.java:53) at com.caucho.server.distcache.AbstractCacheManager.loadExpiredValue(AbstractCacheManager.java:311) at com.caucho.server.distcache.AbstractCacheManager.reloadValue(AbstractCacheManager.java:271) at com.caucho.server.distcache.AbstractCacheManager.getMnodeValue(AbstractCacheManager.java:247) at com.caucho.server.distcache.AbstractCacheManager.get(AbstractCacheManager.java:168) at com.caucho.server.distcache.ProCacheEntry.get(ProCacheEntry.java:81) at com.caucho.distcache.AbstractCache.get(AbstractCache.java:465) at com.caucho.server.repository.ClusterRepository.checkForUpdate(ClusterRepository.java:144) at com.caucho.server.repository.ClusterRepository.start(ClusterRepository.java:110) at com.caucho.env.repository.RepositoryService.start(RepositoryService.java:114) at com.caucho.env.service.ResinSystem.startServices(ResinSystem.java:480) at com.caucho.env.service.ResinSystem.start(ResinSystem.java:452) at com.caucho.server.resin.Resin.start(Resin.java:889) at com.caucho.server.resin.Resin.initMain(Resin.java:1017) at com.caucho.server.resin.Resin.main(Resin.java:1284) At this point, about the only thing we can do is bounce the instance and hope it comes back up cleanly. This morning, we ran into that problem again. I bounced the instance on all 8 of our webservers and 7 of them came up cleanly. One of them did not. The watchdog log is showing this: [2011/01/24 10:48:41.656] {watchdog-wad-c} WatchdogChild[wad-c] starting [2011/01/24 10:48:41.657] {watchdog-wad-c} Watchdog starting Resin[wad-c] [2011/01/24 10:49:51.120] {watchdog-wad-c} Watchdog detected close of Resin[wad-c,pid=16232] exit reason: BAD_CONFIG (exit code=2) [2011/01/24 10:49:51.120] {watchdog-wad-c} WatchdogChild[wad-c] starting [2011/01/24 10:49:51.121] {watchdog-wad-c} Watchdog starting Resin[wad-c] That just repeats as it tries to bring the instance up, decides for whatever reason that it has a BAD_CONFIG, then restarts it again. One thing to note: all 8 servers are configured identically from an operating system and Resin perspective. The resin.xml file is not bad. When I turned logging on this server up to "finer", I am getting roughly 1MB/sec. of log entries relating to "unknown actor" and cluster-cache things. Here's a sample: [2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,INIT] [^] starting listen thread [2011/01/24 10:45:05.726] TcpSocketLink[id=13,wad-c] starting connection TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT], [^] total=0 [2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT] [^] accept [2011/01/24 10:45:05.726] Hmux[wad-c:13] start request [2011/01/24 10:45:05.726] Hmux[wad-c:13] 7-r switch-to-hmtp [2011/01/24 10:45:05.727] TcpSocketLink[id=server://172.27.240.13:6850-14,server://172.27.240.13:6850,INIT] [^] starting listen thread [2011/01/24 10:45:05.733] HempBroker[caa.wad.admin.resin]: message to unknown actor to: cluster-cache-data@caa.wad.admin.resin from: cluster-cache-data@aaa.wad.admin.resin payload: DataPut[value=62589522] [2011/01/24 10:45:05.735] HeartbeatActor[cluster-heartbeat@caa.wad.admin.resin] HeartbeatMessage[7] {to:cluster-heartbeat@caa.wad.admin.resin,from:cluster-heartbeat@haa.wad.admin.resin} [2011/01/24 10:45:05.739] HempBroker[caa.wad.admin.resin]: message to unknown actor to: cluster-cache-data@caa.wad.admin.resin from: cluster-cache-data@baa.wad.admin.resin payload: DataPut[value=3e421d4c] [2011/01/24 10:45:05.741] HempBroker[caa.wad.admin.resin]: query(16292) to unknown actor to:cluster-cache-data@caa.wad.admin.resin from:cluster-cache-data@baa.wad.admin.resin payload:DataPutTriad[value=62589522] We eventually get to a point where the following sample is all we see repeating over and over in the log and the appserver never actually logs that it started: [2011/01/24 10:56:47.362] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] DataPutTriad[value=8d07bf29] cluster-cache-data@haa.wad.admin.resin DataPutTriad[value=8d07bf29] [2011/01/24 10:56:47.363] DataStore[data] updateExpires HashKey[8d07bf29] [2011/01/24 10:56:47.363] DataStore[data] insert HashKey[8d07bf29] length:-1 [2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin} [2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin} [2011/01/24 10:56:47.363] cluster-cache@caa put from=cluster-cache@baa.wad.admin.resin CachePut[key=0b436e6e,value=8d07bf29,flags=6,version=13,lease=7,source=null] [2011/01/24 10:56:47.363] MnodeStore[wad-c] updateSave key=HashKey[0b436e6e] value=HashKey[8d07bf29] [2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@aaa.wad.admin.resin]: message is not implemented for this payload DataPut[value=8d07bf29] {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@aaa.wad.admin.resin}] cluster-cache-data@aaa.wad.admin.resin {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin} [2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@baa.wad.admin.resin]: message is not implemented for this payload DataPut[value=8d07bf29] {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@baa.wad.admin.resin}] cluster-cache-data@baa.wad.admin.resin {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin} [2011/01/24 10:56:47.448] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@31f32e51 query timeout: 60000 to: cluster-cache@caa.wad.admin.resin query: CacheGet[key=e89bd3ca,version=0,server=2] Do you have any ideas as to what may be causing this? We currently have always-save-session and always-load-session enabled while we troubleshoot session issues. Is it possible that we're generating more data than Resin can reasonably handle? (Seems unlikely, but...) |
||||||||
Additional Information | |||||||||
Attached Files | |||||||||
|
There are no notes attached to this issue. |
Mantis 1.0.0rc3[^]
Copyright © 2000 - 2005 Mantis Group
28 total queries executed. 25 unique queries executed. |