Mantis Bugtracker
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0004355 [Resin] minor always 01-24-11 09:19 02-11-11 16:17
Reporter ferg View Status public  
Assigned To ferg
Priority normal Resolution fixed  
Status closed   Product Version
Summary 0004355: cache timeout issues
Description (rep by Jamison Novak)

At least once a day (but usually more frequently), we run into the following error:

[2011/01/24 10:38:56.908] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@5f997454 query timeout: 60000
                            to: cluster-cache@caa.wad.admin.resin
                            query: CacheGet[key=e89bd3ca,version=0,server=2]
                                at com.caucho.cloud.bam.TriadFuture.get(TriadFuture.java:65)
                                at com.caucho.cloud.bam.TriadFirstQuery.get(TriadFirstQuery.java:179)
                                at com.caucho.cloud.bam.BamTriadSender.queryTriadFirstRemote(BamTriadSender.java:642)
                                at com.caucho.server.distcache.CacheMnodeActor.get(CacheMnodeActor.java:126)
                                at com.caucho.server.distcache.CacheMnodeManager.get(CacheMnodeManager.java:137)
                                at com.caucho.server.distcache.CacheClusterBackingImpl.loadClusterValue(CacheClusterBackingImpl.java:53)
                                at com.caucho.server.distcache.AbstractCacheManager.loadExpiredValue(AbstractCacheManager.java:311)
                                at com.caucho.server.distcache.AbstractCacheManager.reloadValue(AbstractCacheManager.java:271)
                                at com.caucho.server.distcache.AbstractCacheManager.getMnodeValue(AbstractCacheManager.java:247)
                                at com.caucho.server.distcache.AbstractCacheManager.get(AbstractCacheManager.java:168)
                                at com.caucho.server.distcache.ProCacheEntry.get(ProCacheEntry.java:81)
                                at com.caucho.distcache.AbstractCache.get(AbstractCache.java:465)
                                at com.caucho.server.repository.ClusterRepository.checkForUpdate(ClusterRepository.java:144)
                                at com.caucho.server.repository.ClusterRepository.start(ClusterRepository.java:110)
                                at com.caucho.env.repository.RepositoryService.start(RepositoryService.java:114)
                                at com.caucho.env.service.ResinSystem.startServices(ResinSystem.java:480)
                                at com.caucho.env.service.ResinSystem.start(ResinSystem.java:452)
                                at com.caucho.server.resin.Resin.start(Resin.java:889)
                                at com.caucho.server.resin.Resin.initMain(Resin.java:1017)
                                at com.caucho.server.resin.Resin.main(Resin.java:1284)


At this point, about the only thing we can do is bounce the instance and hope it comes back up cleanly.

This morning, we ran into that problem again. I bounced the instance on all 8 of our webservers and 7 of them came up cleanly. One of them did not.

The watchdog log is showing this:

[2011/01/24 10:48:41.656] {watchdog-wad-c} WatchdogChild[wad-c] starting
[2011/01/24 10:48:41.657] {watchdog-wad-c} Watchdog starting Resin[wad-c]
[2011/01/24 10:49:51.120] {watchdog-wad-c} Watchdog detected close of Resin[wad-c,pid=16232]
                                             exit reason: BAD_CONFIG (exit code=2)
[2011/01/24 10:49:51.120] {watchdog-wad-c} WatchdogChild[wad-c] starting
[2011/01/24 10:49:51.121] {watchdog-wad-c} Watchdog starting Resin[wad-c]

That just repeats as it tries to bring the instance up, decides for whatever reason that it has a BAD_CONFIG, then restarts it again. One thing to note: all 8 servers are configured identically from an operating system and Resin perspective. The resin.xml file is not bad.

When I turned logging on this server up to "finer", I am getting roughly 1MB/sec. of log entries relating to "unknown actor" and cluster-cache things. Here's a sample:


[2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,INIT] [^] starting listen thread
[2011/01/24 10:45:05.726] TcpSocketLink[id=13,wad-c] starting connection TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT], [^] total=0
[2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT] [^] accept
[2011/01/24 10:45:05.726] Hmux[wad-c:13] start request
[2011/01/24 10:45:05.726] Hmux[wad-c:13] 7-r switch-to-hmtp
[2011/01/24 10:45:05.727] TcpSocketLink[id=server://172.27.240.13:6850-14,server://172.27.240.13:6850,INIT] [^] starting listen thread
[2011/01/24 10:45:05.733] HempBroker[caa.wad.admin.resin]: message to unknown actor
                            to: cluster-cache-data@caa.wad.admin.resin
                            from: cluster-cache-data@aaa.wad.admin.resin
                            payload: DataPut[value=62589522]
[2011/01/24 10:45:05.735] HeartbeatActor[cluster-heartbeat@caa.wad.admin.resin] HeartbeatMessage[7] {to:cluster-heartbeat@caa.wad.admin.resin,from:cluster-heartbeat@haa.wad.admin.resin}
[2011/01/24 10:45:05.739] HempBroker[caa.wad.admin.resin]: message to unknown actor
                            to: cluster-cache-data@caa.wad.admin.resin
                            from: cluster-cache-data@baa.wad.admin.resin
                            payload: DataPut[value=3e421d4c]
[2011/01/24 10:45:05.741] HempBroker[caa.wad.admin.resin]: query(16292) to unknown actor to:cluster-cache-data@caa.wad.admin.resin
                            from:cluster-cache-data@baa.wad.admin.resin
                            payload:DataPutTriad[value=62589522]


We eventually get to a point where the following sample is all we see repeating over and over in the log and the appserver never actually logs that it started:


[2011/01/24 10:56:47.362] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] DataPutTriad[value=8d07bf29] cluster-cache-data@haa.wad.admin.resin DataPutTriad[value=8d07bf29]
[2011/01/24 10:56:47.363] DataStore[data] updateExpires HashKey[8d07bf29]
[2011/01/24 10:56:47.363] DataStore[data] insert HashKey[8d07bf29] length:-1
[2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.363] cluster-cache@caa put from=cluster-cache@baa.wad.admin.resin CachePut[key=0b436e6e,value=8d07bf29,flags=6,version=13,lease=7,source=null]
[2011/01/24 10:56:47.363] MnodeStore[wad-c] updateSave key=HashKey[0b436e6e] value=HashKey[8d07bf29]
[2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@aaa.wad.admin.resin]: message is not implemented for this payload
                            DataPut[value=8d07bf29]
                            {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@aaa.wad.admin.resin}] cluster-cache-data@aaa.wad.admin.resin {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@baa.wad.admin.resin]: message is not implemented for this payload
                            DataPut[value=8d07bf29]
                            {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@baa.wad.admin.resin}] cluster-cache-data@baa.wad.admin.resin {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.448] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@31f32e51 query timeout: 60000
                            to: cluster-cache@caa.wad.admin.resin
                            query: CacheGet[key=e89bd3ca,version=0,server=2]



Do you have any ideas as to what may be causing this? We currently have always-save-session and always-load-session enabled while we troubleshoot session issues. Is it possible that we're generating more data than Resin can reasonably handle? (Seems unlikely, but...)

Additional Information
Attached Files

- Relationships

There are no notes attached to this issue.

- Issue History
Date Modified Username Field Change
01-24-11 09:19 ferg New Issue
01-24-11 09:35 jnovak Issue Monitored: jnovak
02-11-11 16:17 ferg Assigned To  => ferg
02-11-11 16:17 ferg Status new => closed
02-11-11 16:17 ferg Resolution open => fixed
02-11-11 16:17 ferg Fixed in Version  => 4.0.15


Mantis 1.0.0rc3[^]
Copyright © 2000 - 2005 Mantis Group
28 total queries executed.
25 unique queries executed.
Powered by Mantis Bugtracker