0002558: cluster store corruption issue

Viewing Issue Simple Details [ Jump to Notes ]

[ View Advanced ] [ Issue History ] [ Print ]

Category

Severity

Reproducibility

Date Submitted

Last Update

0002558

[Resin]

minor

always

03-27-08 08:14

06-12-08 09:35

Reporter

ferg

View Status

public

Assigned To

ferg

Priority

normal

Resolution

no change required

Status

closed

Product Version

3.1.3

Summary

0002558: cluster store corruption issue

Description

(rep by Andrew Fritz)

Both of the servers in our cluster stopped responding at the same time
and java started using 100% of all CPU resources. Upon killing one
server, the other began responding almost immediately. Restarting the
dead server resulting in MANY exceptions (all roughly the same):

[09:12:53.567] java.lang.IllegalStateException: Can't yet support data over 64M
[09:12:53.567] at com.caucho.db.store.Inode.readFragmentAddr(Inode.java:972)
[09:12:53.567] at com.caucho.db.store.Inode.remove(Inode.java:832)
[09:12:53.567] at com.caucho.db.store.Transaction.writeData(Transaction.java:568)
[09:12:53.567] at com.caucho.db.sql.QueryContext.unlock(QueryContext.java:517)
[09:12:53.567] at com.caucho.db.sql.RowIterateExpr.nextBlock(RowIterateExpr.java:86)
[09:12:53.567] at com.caucho.db.sql.Query.nextBlock(Query.java:713)
[09:12:53.567] at com.caucho.db.sql.Query.nextTuple(Query.java:690)
[09:12:53.567] at com.caucho.db.sql.DeleteQuery.execute(DeleteQuery.java:81)
[09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.execute(PreparedStatementImpl.java:345)
[09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.executeUpdate(PreparedStatementImpl.java:313)
[09:12:53.567] at com.caucho.server.cluster.FileBacking.clearOldObjects(FileBacking.java:260)
[09:12:53.567] at com.caucho.server.cluster.ClusterStore.clearOldObjects(ClusterStore.java:358)
[09:12:53.567] at com.caucho.server.cluster.StoreManager.handleAlarm(StoreManager.java:637)
[09:12:53.567] at com.caucho.server.cluster.StoreManager.start(StoreManager.java:386)
[09:12:53.567] at com.caucho.server.cluster.ClusterStore.start(ClusterStore.java:196)
[09:12:53.567] at com.caucho.server.cluster.Cluster.environmentStart(Cluster.java:928)
[09:12:53.567] at com.caucho.loader.EnvironmentClassLoader.start(EnvironmentClassLoader.java:475)
[09:12:53.567] at com.caucho.server.cluster.Server.start(Server.java:1149)
[09:12:53.567] at com.caucho.server.cluster.Cluster.startServer(Cluster.java:719)
[09:12:53.567] at com.caucho.server.cluster.ClusterServer.startServer(ClusterServer.java:455)
[09:12:53.567] at com.caucho.server.resin.Resin.start(Resin.java:694)
[09:12:53.567] at com.caucho.server.resin.Resin.initMain(Resin.java:1114)
[09:12:53.567] at com.caucho.server.resin.Resin.main(Resin.java:1316)

This exception appeared many time, but everything appears to be working
again. I found one reference related to this being cluster store
corruption possibly related to locking issues. Since our fence came down
(allowing public access, vs beta group only access) spiders have been
hitting our site pretty hard which could result in a lot more lock
contention (several 1000 hits on a server in rapid succession). Not sure
if this might be related.

Any idea what the root cause of this hang up was, or what I can do to
prevent it in the future?

Additional Information

Attached Files

Relationships

Notes
(0003185) ferg 06-12-08 09:35	In this case, the server crashed unexpectedly, corrupting some of the session backing store, i.e. it's not a general corruption issue. The exceptions are due to the cleanup/validation phase of the session on server start. In the future, we may want to improve the error messages and/or improve the startup validation phase, but the current cleanup is doing its job (if a bit noisily). The 100% cpu would be a different issue, but would need a profile/thread dump to track down.

Issue History
Date Modified	Username	Field	Change
03-27-08 08:14	ferg	New Issue
06-12-08 09:35	ferg	Note Added: 0003185
06-12-08 09:35	ferg	Assigned To	=> ferg
06-12-08 09:35	ferg	Status	new => closed
06-12-08 09:35	ferg	Resolution	open => no change required

Mantis 1.0.0rc3[^]

28 total queries executed.
25 unique queries executed.