[CHiPPS-tickets] [COIN-OR High-Performance Parallel Search] #23: MPI BLIS hangs during termination check...
COIN-OR High-Performance Parallel Search
coin-trac at coin-or.org
Tue Jan 27 15:47:01 EST 2009
#23: MPI BLIS hangs during termination check...
-----------------------+----------------------------------------------------
Reporter: nedwards | Owner: yanxu
Type: defect | Status: new
Priority: major | Component: ALPS
Version: stable/0.9 | Resolution:
Keywords: |
-----------------------+----------------------------------------------------
Comment (by nedwards):
OK. I believe I have identified the bug.
The termination check problem was occurring due to the master waiting for
a message from the wrong incumbentID (node with best solution value). This
occurs because at some point in the incumbent value distribution (more
about this later) a bogus very negative value is unpacked from the
received buffer by a worker.
The race condition is in the code that sends the incumbent value and ID
information, broadcast in a tree fashion amongst the workers. A single
send buffer is packed with the value and ID, and then non-blocking sends
are initiated to the left and right children of the current node. Later,
when further communication to the left and right children is needed, only
the left or right non-blocking send is checked for completion, despite the
fact that they are both using the same packed buffer.
If one (say the left) completes and permits the subsequent left
communication to proceed before the right is complete, the buffer is
corrupted with the new left communication.
My current solution is to add:
{{{
MPI_Status sentStatus;
MPI_Wait(&forwardRequestL_, &sentStatus);
MPI_Wait(&forwardRequestR_, &sentStatus);
}}}
at the end of function AlpsKnowledgeBrokerMPI::sendIncumbent() in
AlpsKnowledgeBrokerMPI.cpp though this probably introduces some overly
conservative process synchronization.
I don't know if there are other places where this bug manifests itself.
Let me know if the above description isn't clear.
- n
--
Ticket URL: <https://projects.coin-or.org/CHiPPS/ticket/23#comment:1>
COIN-OR High-Performance Parallel Search <http://projects.coin-or.org/CHiPPS>
A framework for data-intensive tree-search algorithms.
More information about the CHiPPS-tickets
mailing list