Mantis - Resin
Viewing Issue Advanced Details
5286 block random 11-21-12 02:33 01-09-13 12:10
paul  
ferg  
normal  
closed 3.1.9  
fixed  
none    
none 4.0.32  
0005286: Threads can become 'stuck' forever
Noticed a few times that the thread count on a server (in a 3 server cluster) can become 'stuck'. Looking through the resin.log reveals the error message

[2012-11-09 04:56:41.419] java.io.IOException: failed to add EPOLL for fd=512
[2012-11-09 04:56:41.419]
[2012-11-09 04:56:41.419] at com.caucho.server.port.JniSelectManager.removeNative(Native Method)
[2012-11-09 04:56:41.419] at com.caucho.server.port.JniSelectManager.remove(JniSelectManager.java:476)
[2012-11-09 04:56:41.419] at com.caucho.server.port.JniSelectManager.run(JniSelectManager.java:376)
[2012-11-09 04:56:41.419] at java.lang.Thread.run(Thread.java:662)


Taking a Threaddump reveals a lot of threads (up to 200 sometimes) all with a stacktrace similar to

"hmux-192.168.0.3:6800-20399$1859226681" - Thread t@40
   java.lang.Thread.State: RUNNABLE
    at com.caucho.server.port.JniSelectManager.addNative(Native Method)
    at com.caucho.server.port.JniSelectManager.keepalive(JniSelectManager.java:229)
    at com.caucho.server.port.TcpConnection.keepalive(TcpConnection.java:448)
    at com.caucho.server.port.TcpConnection.run(TcpConnection.java:739)
    at com.caucho.util.ThreadPool$Item.runTasks(ThreadPool.java:743)
    at com.caucho.util.ThreadPool$Item.run(ThreadPool.java:662)
    at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
    - None

Looking at the native code, guessing that the failure causes the method to exit without first releasing the mutex that it took, and thus leaves any subsequent callers stuck waiting for the same mutex to become available.
Checked the JNI src for 3.1.9Pro, 3.1.12Pro and 3.1.13Pro and all seem to be the same.

There are no notes attached to this issue.