Mantis Bugtracker
  

Viewing Issue Advanced Details Jump to Notes ] View Simple ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0004355 [Resin] minor always 01-24-11 09:19 02-11-11 16:17
Reporter ferg View Status public  
Assigned To ferg
Priority normal Resolution fixed Platform
Status closed   OS
Projection none   OS Version
ETA none Fixed in Version 4.0.15 Product Version
  Product Build
Summary 0004355: cache timeout issues
Description (rep by Jamison Novak)

At least once a day (but usually more frequently), we run into the following error:

[2011/01/24 10:38:56.908] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@5f997454 query timeout: 60000
                            to: cluster-cache@caa.wad.admin.resin
                            query: CacheGet[key=e89bd3ca,version=0,server=2]
                                at com.caucho.cloud.bam.TriadFuture.get(TriadFuture.java:65)
                                at com.caucho.cloud.bam.TriadFirstQuery.get(TriadFirstQuery.java:179)
                                at com.caucho.cloud.bam.BamTriadSender.queryTriadFirstRemote(BamTriadSender.java:642)
                                at com.caucho.server.distcache.CacheMnodeActor.get(CacheMnodeActor.java:126)
                                at com.caucho.server.distcache.CacheMnodeManager.get(CacheMnodeManager.java:137)
                                at com.caucho.server.distcache.CacheClusterBackingImpl.loadClusterValue(CacheClusterBackingImpl.java:53)
                                at com.caucho.server.distcache.AbstractCacheManager.loadExpiredValue(AbstractCacheManager.java:311)
                                at com.caucho.server.distcache.AbstractCacheManager.reloadValue(AbstractCacheManager.java:271)
                                at com.caucho.server.distcache.AbstractCacheManager.getMnodeValue(AbstractCacheManager.java:247)
                                at com.caucho.server.distcache.AbstractCacheManager.get(AbstractCacheManager.java:168)
                                at com.caucho.server.distcache.ProCacheEntry.get(ProCacheEntry.java:81)
                                at com.caucho.distcache.AbstractCache.get(AbstractCache.java:465)
                                at com.caucho.server.repository.ClusterRepository.checkForUpdate(ClusterRepository.java:144)
                                at com.caucho.server.repository.ClusterRepository.start(ClusterRepository.java:110)
                                at com.caucho.env.repository.RepositoryService.start(RepositoryService.java:114)
                                at com.caucho.env.service.ResinSystem.startServices(ResinSystem.java:480)
                                at com.caucho.env.service.ResinSystem.start(ResinSystem.java:452)
                                at com.caucho.server.resin.Resin.start(Resin.java:889)
                                at com.caucho.server.resin.Resin.initMain(Resin.java:1017)
                                at com.caucho.server.resin.Resin.main(Resin.java:1284)


At this point, about the only thing we can do is bounce the instance and hope it comes back up cleanly.

This morning, we ran into that problem again. I bounced the instance on all 8 of our webservers and 7 of them came up cleanly. One of them did not.

The watchdog log is showing this:

[2011/01/24 10:48:41.656] {watchdog-wad-c} WatchdogChild[wad-c] starting
[2011/01/24 10:48:41.657] {watchdog-wad-c} Watchdog starting Resin[wad-c]
[2011/01/24 10:49:51.120] {watchdog-wad-c} Watchdog detected close of Resin[wad-c,pid=16232]
                                             exit reason: BAD_CONFIG (exit code=2)
[2011/01/24 10:49:51.120] {watchdog-wad-c} WatchdogChild[wad-c] starting
[2011/01/24 10:49:51.121] {watchdog-wad-c} Watchdog starting Resin[wad-c]

That just repeats as it tries to bring the instance up, decides for whatever reason that it has a BAD_CONFIG, then restarts it again. One thing to note: all 8 servers are configured identically from an operating system and Resin perspective. The resin.xml file is not bad.

When I turned logging on this server up to "finer", I am getting roughly 1MB/sec. of log entries relating to "unknown actor" and cluster-cache things. Here's a sample:


[2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,INIT] [^] starting listen thread
[2011/01/24 10:45:05.726] TcpSocketLink[id=13,wad-c] starting connection TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT], [^] total=0
[2011/01/24 10:45:05.726] TcpSocketLink[id=server://172.27.240.13:6850-13,server://172.27.240.13:6850,ACCEPT] [^] accept
[2011/01/24 10:45:05.726] Hmux[wad-c:13] start request
[2011/01/24 10:45:05.726] Hmux[wad-c:13] 7-r switch-to-hmtp
[2011/01/24 10:45:05.727] TcpSocketLink[id=server://172.27.240.13:6850-14,server://172.27.240.13:6850,INIT] [^] starting listen thread
[2011/01/24 10:45:05.733] HempBroker[caa.wad.admin.resin]: message to unknown actor
                            to: cluster-cache-data@caa.wad.admin.resin
                            from: cluster-cache-data@aaa.wad.admin.resin
                            payload: DataPut[value=62589522]
[2011/01/24 10:45:05.735] HeartbeatActor[cluster-heartbeat@caa.wad.admin.resin] HeartbeatMessage[7] {to:cluster-heartbeat@caa.wad.admin.resin,from:cluster-heartbeat@haa.wad.admin.resin}
[2011/01/24 10:45:05.739] HempBroker[caa.wad.admin.resin]: message to unknown actor
                            to: cluster-cache-data@caa.wad.admin.resin
                            from: cluster-cache-data@baa.wad.admin.resin
                            payload: DataPut[value=3e421d4c]
[2011/01/24 10:45:05.741] HempBroker[caa.wad.admin.resin]: query(16292) to unknown actor to:cluster-cache-data@caa.wad.admin.resin
                            from:cluster-cache-data@baa.wad.admin.resin
                            payload:DataPutTriad[value=62589522]


We eventually get to a point where the following sample is all we see repeating over and over in the log and the appserver never actually logs that it started:


[2011/01/24 10:56:47.362] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] DataPutTriad[value=8d07bf29] cluster-cache-data@haa.wad.admin.resin DataPutTriad[value=8d07bf29]
[2011/01/24 10:56:47.363] DataStore[data] updateExpires HashKey[8d07bf29]
[2011/01/24 10:56:47.363] DataStore[data] insert HashKey[8d07bf29] length:-1
[2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.363] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] message ignored DataPut[value=8d07bf29] {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.363] cluster-cache@caa put from=cluster-cache@baa.wad.admin.resin CachePut[key=0b436e6e,value=8d07bf29,flags=6,version=13,lease=7,source=null]
[2011/01/24 10:56:47.363] MnodeStore[wad-c] updateSave key=HashKey[0b436e6e] value=HashKey[8d07bf29]
[2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@aaa.wad.admin.resin]: message is not implemented for this payload
                            DataPut[value=8d07bf29]
                            {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@aaa.wad.admin.resin}] cluster-cache-data@aaa.wad.admin.resin {from:cluster-cache-data@aaa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.364] CacheDataTriadActor[cluster-cache-data@caa.wad.admin.resin] messageError ignored ActorError[type=cancel,group=feature-not-implemented,text=CacheDataTriadActor[cluster-cache-data@baa.wad.admin.resin]: message is not implemented for this payload
                            DataPut[value=8d07bf29]
                            {from:cluster-cache-data@caa.wad.admin.resin, to:cluster-cache-data@baa.wad.admin.resin}] cluster-cache-data@baa.wad.admin.resin {from:cluster-cache-data@baa.wad.admin.resin, to:cluster-cache-data@caa.wad.admin.resin}
[2011/01/24 10:56:47.448] com.caucho.bam.TimeoutException: com.caucho.cloud.bam.TriadFuture@31f32e51 query timeout: 60000
                            to: cluster-cache@caa.wad.admin.resin
                            query: CacheGet[key=e89bd3ca,version=0,server=2]



Do you have any ideas as to what may be causing this? We currently have always-save-session and always-load-session enabled while we troubleshoot session issues. Is it possible that we're generating more data than Resin can reasonably handle? (Seems unlikely, but...)

Steps To Reproduce
Additional Information
Attached Files

- Relationships

There are no notes attached to this issue.

- Issue History
Date Modified Username Field Change
01-24-11 09:19 ferg New Issue
01-24-11 09:35 jnovak Issue Monitored: jnovak
02-11-11 16:17 ferg Assigned To  => ferg
02-11-11 16:17 ferg Status new => closed
02-11-11 16:17 ferg Resolution open => fixed
02-11-11 16:17 ferg Fixed in Version  => 4.0.15


Mantis 1.0.0rc3[^]
Copyright © 2000 - 2005 Mantis Group
28 total queries executed.
25 unique queries executed.
Powered by Mantis Bugtracker