[CHiPPS-tickets] [COIN-OR High-Performance Parallel Search] #23: MPI BLIS hangs during termination check...

COIN-OR High-Performance Parallel Search coin-trac at coin-or.org
Tue Jan 27 15:47:01 EST 2009


#23: MPI BLIS hangs during termination check...
-----------------------+----------------------------------------------------
Reporter:  nedwards    |        Owner:  yanxu
    Type:  defect      |       Status:  new  
Priority:  major       |    Component:  ALPS 
 Version:  stable/0.9  |   Resolution:       
Keywords:              |  
-----------------------+----------------------------------------------------
Comment (by nedwards):

 OK. I believe I have identified the bug.

 The termination check problem was occurring due to the master waiting for
 a message from the wrong incumbentID (node with best solution value). This
 occurs because at some point in the incumbent value distribution (more
 about this later) a bogus very negative value is unpacked from the
 received buffer by a worker.

 The race condition is in the code that sends the incumbent value and ID
 information, broadcast in a tree fashion amongst the workers. A single
 send buffer is packed with the value and ID, and then non-blocking sends
 are initiated to the left and right children of the current node. Later,
 when further communication to the left and right children is needed, only
 the left or right non-blocking send is checked for completion, despite the
 fact that they are both using the same packed buffer.

 If one (say the left) completes and permits the subsequent left
 communication to proceed before the right is complete, the buffer is
 corrupted with the new left communication.

 My current solution is to add:
 {{{
 MPI_Status sentStatus;
 MPI_Wait(&forwardRequestL_, &sentStatus);
 MPI_Wait(&forwardRequestR_, &sentStatus);
 }}}
 at the end of function AlpsKnowledgeBrokerMPI::sendIncumbent() in
 AlpsKnowledgeBrokerMPI.cpp though this probably introduces some overly
 conservative process synchronization.

 I don't know if there are other places where this bug manifests itself.

 Let me know if the above description isn't clear.

 - n

-- 
Ticket URL: <https://projects.coin-or.org/CHiPPS/ticket/23#comment:1>
COIN-OR High-Performance Parallel Search <http://projects.coin-or.org/CHiPPS>
A framework for data-intensive tree-search algorithms.



More information about the CHiPPS-tickets mailing list