Mantis Bugtracker
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0002558 [Resin] minor always 03-27-08 08:14 06-12-08 09:35
Reporter ferg View Status public  
Assigned To ferg
Priority normal Resolution no change required  
Status closed   Product Version 3.1.3
Summary 0002558: cluster store corruption issue
Description (rep by Andrew Fritz)


Both of the servers in our cluster stopped responding at the same time
and java started using 100% of all CPU resources. Upon killing one
server, the other began responding almost immediately. Restarting the
dead server resulting in MANY exceptions (all roughly the same):

[09:12:53.567] java.lang.IllegalStateException: Can't yet support data over 64M
[09:12:53.567] at com.caucho.db.store.Inode.readFragmentAddr(Inode.java:972)
[09:12:53.567] at com.caucho.db.store.Inode.remove(Inode.java:832)
[09:12:53.567] at com.caucho.db.store.Transaction.writeData(Transaction.java:568)
[09:12:53.567] at com.caucho.db.sql.QueryContext.unlock(QueryContext.java:517)
[09:12:53.567] at com.caucho.db.sql.RowIterateExpr.nextBlock(RowIterateExpr.java:86)
[09:12:53.567] at com.caucho.db.sql.Query.nextBlock(Query.java:713)
[09:12:53.567] at com.caucho.db.sql.Query.nextTuple(Query.java:690)
[09:12:53.567] at com.caucho.db.sql.DeleteQuery.execute(DeleteQuery.java:81)
[09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.execute(PreparedStatementImpl.java:345)
[09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.executeUpdate(PreparedStatementImpl.java:313)
[09:12:53.567] at com.caucho.server.cluster.FileBacking.clearOldObjects(FileBacking.java:260)
[09:12:53.567] at com.caucho.server.cluster.ClusterStore.clearOldObjects(ClusterStore.java:358)
[09:12:53.567] at com.caucho.server.cluster.StoreManager.handleAlarm(StoreManager.java:637)
[09:12:53.567] at com.caucho.server.cluster.StoreManager.start(StoreManager.java:386)
[09:12:53.567] at com.caucho.server.cluster.ClusterStore.start(ClusterStore.java:196)
[09:12:53.567] at com.caucho.server.cluster.Cluster.environmentStart(Cluster.java:928)
[09:12:53.567] at com.caucho.loader.EnvironmentClassLoader.start(EnvironmentClassLoader.java:475)
[09:12:53.567] at com.caucho.server.cluster.Server.start(Server.java:1149)
[09:12:53.567] at com.caucho.server.cluster.Cluster.startServer(Cluster.java:719)
[09:12:53.567] at com.caucho.server.cluster.ClusterServer.startServer(ClusterServer.java:455)
[09:12:53.567] at com.caucho.server.resin.Resin.start(Resin.java:694)
[09:12:53.567] at com.caucho.server.resin.Resin.initMain(Resin.java:1114)
[09:12:53.567] at com.caucho.server.resin.Resin.main(Resin.java:1316)

This exception appeared many time, but everything appears to be working
again. I found one reference related to this being cluster store
corruption possibly related to locking issues. Since our fence came down
(allowing public access, vs beta group only access) spiders have been
hitting our site pretty hard which could result in a lot more lock
contention (several 1000 hits on a server in rapid succession). Not sure
if this might be related.

Any idea what the root cause of this hang up was, or what I can do to
prevent it in the future?

Additional Information
Attached Files

- Relationships

- Notes
(0003185)
ferg
06-12-08 09:35

In this case, the server crashed unexpectedly, corrupting some of the session backing store, i.e. it's not a general corruption issue. The exceptions are due to the cleanup/validation phase of the session on server start.

In the future, we may want to improve the error messages and/or improve the startup validation phase, but the current cleanup is doing its job (if a bit noisily).

The 100% cpu would be a different issue, but would need a profile/thread dump to track down.
 

- Issue History
Date Modified Username Field Change
03-27-08 08:14 ferg New Issue
06-12-08 09:35 ferg Note Added: 0003185
06-12-08 09:35 ferg Assigned To  => ferg
06-12-08 09:35 ferg Status new => closed
06-12-08 09:35 ferg Resolution open => no change required


Mantis 1.0.0rc3[^]
Copyright © 2000 - 2005 Mantis Group
28 total queries executed.
25 unique queries executed.
Powered by Mantis Bugtracker