Mantis - Quercus
Viewing Issue Advanced Details
1194 major sometimes 06-12-06 15:15 06-28-06 11:47
slamb  
ferg  
normal  
closed 3.0.19  
fixed  
none    
none 3.0.20  
0001194: Two running wrapper.pls after restart
Encountered this on Resin 2.1.17; looks like wrapper.pl in 3.0.19 is essentially the same.

Resin doesn't seem to completely let go of the logfiles unless I completely restart it. Thus, I do so in logrotate nightly. (Perhaps that's a separate bug, and I haven't checked if it applies to 3.0.)

In any case, sometimes after restart there are two wrapper.pls running. One of them starts up Java, which realizes that it can't bind to the port and dies. It then starts it again, and again, and again. The result is horrible VM and CPU spikes bringing up a JVM all the time for nothing.

I added some syslogs to diagnose and found this happens about once a week:

1. wrapper.pl B starts, SIGTERMs wrapper.pl A, creates a pidfile, and then wrapper.pl A unlinks it.
2. wrapper.pl C starts and has no way of knowing about wrapper.pl B. Both continue to run, with nasty consequences.

I found this comment in wrapper.pl:

    # unlink needs to happen relatively soon so restart's pid won't
    # get unlinked

That's inadequate. There's still a race, and it's still sometimes being lost.

My solution is to not make the new pidfile until it knows the old process is dead and therefore won't be unlinking any files. Something like this:

        my $signal = 15;
        while (kill($signal, $pid)) {
            syslog('info', "pid $$ delivered signal $signal to $pid (old wrapper)");
            sleep(1);
            $signal = 9;
        }
        syslog('info', "pid $$: old wrapper $pid should be dead now");

There are no notes attached to this issue.