0002558: cluster store corruption issue

Mantis - Resin
Viewing Issue Advanced Details

ID:	Category:	Severity:	Reproducibility:	Date Submitted:	Last Update:
2558		minor	always	03-27-08 08:14	06-12-08 09:35

Reporter:	ferg	Platform:
Assigned To:	ferg	OS:
Priority:	normal	OS Version:
Status:	closed	Product Version:	3.1.3
Product Build:		Resolution:	no change required
Projection:	none
ETA:	none	Fixed in Version:

Summary:	0002558: cluster store corruption issue
Description:	(rep by Andrew Fritz) Both of the servers in our cluster stopped responding at the same time and java started using 100% of all CPU resources. Upon killing one server, the other began responding almost immediately. Restarting the dead server resulting in MANY exceptions (all roughly the same): [09:12:53.567] java.lang.IllegalStateException: Can't yet support data over 64M [09:12:53.567] at com.caucho.db.store.Inode.readFragmentAddr(Inode.java:972) [09:12:53.567] at com.caucho.db.store.Inode.remove(Inode.java:832) [09:12:53.567] at com.caucho.db.store.Transaction.writeData(Transaction.java:568) [09:12:53.567] at com.caucho.db.sql.QueryContext.unlock(QueryContext.java:517) [09:12:53.567] at com.caucho.db.sql.RowIterateExpr.nextBlock(RowIterateExpr.java:86) [09:12:53.567] at com.caucho.db.sql.Query.nextBlock(Query.java:713) [09:12:53.567] at com.caucho.db.sql.Query.nextTuple(Query.java:690) [09:12:53.567] at com.caucho.db.sql.DeleteQuery.execute(DeleteQuery.java:81) [09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.execute(PreparedStatementImpl.java:345) [09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.executeUpdate(PreparedStatementImpl.java:313) [09:12:53.567] at com.caucho.server.cluster.FileBacking.clearOldObjects(FileBacking.java:260) [09:12:53.567] at com.caucho.server.cluster.ClusterStore.clearOldObjects(ClusterStore.java:358) [09:12:53.567] at com.caucho.server.cluster.StoreManager.handleAlarm(StoreManager.java:637) [09:12:53.567] at com.caucho.server.cluster.StoreManager.start(StoreManager.java:386) [09:12:53.567] at com.caucho.server.cluster.ClusterStore.start(ClusterStore.java:196) [09:12:53.567] at com.caucho.server.cluster.Cluster.environmentStart(Cluster.java:928) [09:12:53.567] at com.caucho.loader.EnvironmentClassLoader.start(EnvironmentClassLoader.java:475) [09:12:53.567] at com.caucho.server.cluster.Server.start(Server.java:1149) [09:12:53.567] at com.caucho.server.cluster.Cluster.startServer(Cluster.java:719) [09:12:53.567] at com.caucho.server.cluster.ClusterServer.startServer(ClusterServer.java:455) [09:12:53.567] at com.caucho.server.resin.Resin.start(Resin.java:694) [09:12:53.567] at com.caucho.server.resin.Resin.initMain(Resin.java:1114) [09:12:53.567] at com.caucho.server.resin.Resin.main(Resin.java:1316) This exception appeared many time, but everything appears to be working again. I found one reference related to this being cluster store corruption possibly related to locking issues. Since our fence came down (allowing public access, vs beta group only access) spiders have been hitting our site pretty hard which could result in a lot more lock contention (several 1000 hits on a server in rapid succession). Not sure if this might be related. Any idea what the root cause of this hang up was, or what I can do to prevent it in the future?
Steps To Reproduce:
Additional Information:
Relationships
Attached Files:

Notes