Mantis - Resin
Viewing Issue Advanced Details
2558 minor always 03-27-08 08:14 06-12-08 09:35
ferg  
ferg  
normal  
closed 3.1.3  
no change required  
none    
none  
0002558: cluster store corruption issue
(rep by Andrew Fritz)


Both of the servers in our cluster stopped responding at the same time
and java started using 100% of all CPU resources. Upon killing one
server, the other began responding almost immediately. Restarting the
dead server resulting in MANY exceptions (all roughly the same):

[09:12:53.567] java.lang.IllegalStateException: Can't yet support data over 64M
[09:12:53.567] at com.caucho.db.store.Inode.readFragmentAddr(Inode.java:972)
[09:12:53.567] at com.caucho.db.store.Inode.remove(Inode.java:832)
[09:12:53.567] at com.caucho.db.store.Transaction.writeData(Transaction.java:568)
[09:12:53.567] at com.caucho.db.sql.QueryContext.unlock(QueryContext.java:517)
[09:12:53.567] at com.caucho.db.sql.RowIterateExpr.nextBlock(RowIterateExpr.java:86)
[09:12:53.567] at com.caucho.db.sql.Query.nextBlock(Query.java:713)
[09:12:53.567] at com.caucho.db.sql.Query.nextTuple(Query.java:690)
[09:12:53.567] at com.caucho.db.sql.DeleteQuery.execute(DeleteQuery.java:81)
[09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.execute(PreparedStatementImpl.java:345)
[09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.executeUpdate(PreparedStatementImpl.java:313)
[09:12:53.567] at com.caucho.server.cluster.FileBacking.clearOldObjects(FileBacking.java:260)
[09:12:53.567] at com.caucho.server.cluster.ClusterStore.clearOldObjects(ClusterStore.java:358)
[09:12:53.567] at com.caucho.server.cluster.StoreManager.handleAlarm(StoreManager.java:637)
[09:12:53.567] at com.caucho.server.cluster.StoreManager.start(StoreManager.java:386)
[09:12:53.567] at com.caucho.server.cluster.ClusterStore.start(ClusterStore.java:196)
[09:12:53.567] at com.caucho.server.cluster.Cluster.environmentStart(Cluster.java:928)
[09:12:53.567] at com.caucho.loader.EnvironmentClassLoader.start(EnvironmentClassLoader.java:475)
[09:12:53.567] at com.caucho.server.cluster.Server.start(Server.java:1149)
[09:12:53.567] at com.caucho.server.cluster.Cluster.startServer(Cluster.java:719)
[09:12:53.567] at com.caucho.server.cluster.ClusterServer.startServer(ClusterServer.java:455)
[09:12:53.567] at com.caucho.server.resin.Resin.start(Resin.java:694)
[09:12:53.567] at com.caucho.server.resin.Resin.initMain(Resin.java:1114)
[09:12:53.567] at com.caucho.server.resin.Resin.main(Resin.java:1316)

This exception appeared many time, but everything appears to be working
again. I found one reference related to this being cluster store
corruption possibly related to locking issues. Since our fence came down
(allowing public access, vs beta group only access) spiders have been
hitting our site pretty hard which could result in a lot more lock
contention (several 1000 hits on a server in rapid succession). Not sure
if this might be related.

Any idea what the root cause of this hang up was, or what I can do to
prevent it in the future?


Notes
(0003185)
ferg   
06-12-08 09:35   
In this case, the server crashed unexpectedly, corrupting some of the session backing store, i.e. it's not a general corruption issue. The exceptions are due to the cleanup/validation phase of the session on server start.

In the future, we may want to improve the error messages and/or improve the startup validation phase, but the current cleanup is doing its job (if a bit noisily).

The 100% cpu would be a different issue, but would need a profile/thread dump to track down.