Mantis - Resin
Viewing Issue Advanced Details
3877 minor always 02-04-10 18:00 02-22-10 12:00
alex  
ferg  
normal  
closed 4.0.3  
fixed  
none    
none 4.0.4  
0003877: Uneven distribution of requests across a cluster with dead nodes
Configuration:
  Mac OS X, dual CPU
  cluster: a, b, c, d, e, f, g
  inactive-nodes: a, d
  apache: 2.2.14
     11 processes started
  10000 requests issued

Expected Results ? even distribution( 2000 requests each)

Actual Results:

a 0 ? node is down
b 2799
c 1439
d 0 - node is down
e 2895
f 1456
g 1411

Notes
(0004415)
alex   
02-04-10 18:07   
It appears as each thread/process has a copy of cluster and active_count on a particular srun never tracks total active_socket counts.

With the cost at 0 for every one of sruns nodes following the failed nodes get selected at a rate proportional to the number of the preceding dead nodes.

with a and b down, server c gets 'a's and 'b's share serving triple the load
a 0
b 0
c 4298
d 1444
e 1424
f 1413
g 1421

(0004418)
ferg   
02-05-10 09:03   
The backup calculation was using the old 3.1 session encoding, and needed to be updated to the 4.0 encoding.
(0004421)
alex   
02-09-10 09:34   
Retested the case with build off the trunk:
debian-5-64-bit
apache 2.2.14

The problem appears to be in select_host code where active_sockets invariably equal 0, so all server have equal cost, therefore next node after the failed takes their load.

a 0
b 0
c 2164
d 726
e 713
f 720
g 724
(0004446)
alex   
02-22-10 12:00   
fix verified with resin 4.0.4 and resin 3.1.10