[Cops] Questions on ALPS

Yan Xu Yan.Xu at sas.com
Wed Jan 16 12:21:22 EST 2008


Khoa,

"-Alps_masterBalancePeriod 60" specifies that the master balance workload every 60 nodes(processed);
"-Alps_hubReportPeriod 45" specifies that the hubs report the search state (work quality/quantify, numbers of message send or received, etc) every 45 nodes (processed).

"-Alps_masterInitNodeNum 200000000" says the master will generate 200000000 nodes for hubs during rampup; and
"-Alps_hubInitNodeNum 40000000" says a hub will generate 40000000 nodes for its worker during rampup.

Note Alps_masterInitNodeNum and Alps_hubInitNodeNum are only valid when set Alps_staticBalanceScheme to 'root initializaton' (default is spiral, see AlpsParams.h).

200000000 and 40000000 are way too large. Basically means, search will be stucked in rampup and master does all the search.



"Alps_unitWorkNodes 10000000" specifies that a worker will processes 10000000 nodes and then check/handle messages, then processes 10000000 nodes, and ...

10000000 is too big to be a unit of work. Dynamically load balancing have little chance to kick in.


I will suggest first set Alps_staticBalanceScheme to 0 (root initialization), and untouch all other parameters,  and see the results.

Is it possible to share the logs.

Yan







-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Tuesday, January 15, 2008 12:36 PM
To: Yan Xu
Subject: RE: Questions on ALPS

Hi Yan,

Thanks, now I understand why the program didn't stop. I will set these
initNodeNum smaller. However I need to test to avoid "sleeping" states
when running more than a few hours.
At the moment I run my program on a cluster with five 4-core machines.
Sorry I didn't make an average statistics of time for processing a node,
as I tested many instances. I am switching to use the bigger cluster of
the university, expected up to 128 nodes.

Is there any rule or hint of setting these parameters? Changing them
from default to following values did improve my program (no sleeping
process in long running).
 -Alps_masterBalancePeriod 60
 -Alps_hubReportPeriod 45
 -Alps_masterInitNodeNum 200000000
 -Alps_hubInitNodeNum 40000000
 -Alps_unitWorkNodes 10000000

Khoa

-----Original Message-----
From: Yan Xu [mailto:Yan.Xu at sas.com]
Sent: Tuesday, January 15, 2008 3:06 PM
To: Khoa Vo
Subject: RE: Questions on ALPS

Khoa,

If using root initialization and setting Alps_masterInitNodeNum to
200000000, then the master will generate 200000000 nodes for other
processors during rampup. It has good change that master can not pass
the ramp-up phase. ALPS does not check if reaching time limit during
rampup.

Also the unit work seems very large.

Probably you can make

-Alps_masterInitNodeNum 200000000
-Alps_hubInitNodeNum 40000000
-Alps_unitWorkNodes 10000000

much smaller, or let ALPS choose a value for you (comment out the
parameters)


how many processors are you using? how long does it take to process a
node?

Yan


-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Tuesday, January 15, 2008 6:08 AM
To: Yan Xu
Subject: RE: Questions on ALPS

Hi Yan,

Sounds great! Thanks for your fix. I will try it and let you know the
result.

I have another question regarding the time limit. I tried changing
(increasing) to avoid "sleeping states" when program is running long:
 -Alps_masterBalancePeriod 60
 -Alps_hubReportPeriod 45
 -Alps_masterInitNodeNum 200000000
 -Alps_hubInitNodeNum 40000000
 -Alps_unitWorkNodes 10000000

Then the time setting with Alps_timeLimit doesn't work anymore. The
program continues running after the time is exceeded.
Is this by design intent, or did I have something wrong?

Thanks,
Khoa

-----Original Message-----
From: Yan Xu [mailto:Yan.Xu at sas.com]
Sent: Monday, January 14, 2008 9:46 PM
To: Khoa Vo
Subject: RE: Questions on ALPS

Khoa,

I think I fixed the solution limit issue. You can use Alps_solLimit to
stop search. For simplicity, solution limit is not checked during
rampup. Hopefully, it works fine. The change was committed into trunk,
you can get trunk by

svn checkout https://projects.coin-or.org/svn/CHiPPS/Alps/trunk
Alps-trunk


regards,
Yan

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Thursday, January 10, 2008 5:07 AM
To: Yan Xu
Subject: RE: Questions on ALPS


Hi Yan,

Thanks, I am playing around with the parameters.
Let me know when you have the "solution limit" available.

Best,
Khoa

Yan Xu <Yan.Xu at sas.com> said:

> Khoa,
>
> Wish you a happy new year too!
>
>
https://projects.coin-or.org/CHiPPS/browser/Alps/trunk/Alps/src/AlpsPara
ms.h
>
> has some explanation of the ALPS parameters.
>
> Alps_needWorkThreshold: a worker will ask for subtree/nodes if its
workload
is below the value specified by this parameter.
> Alps_changeWorkThreshold: a worker will quit working on the current
subtree
if its quality is worse than the best one in subtree pool for certain
amount.
> Alps_unitWorkNodes: a worker will process the number of nodes before
checking message. Your understanding is right. Basically, a worker will
process a unit of nodes, then handling message, and then process a unit
of
nodes,..., until terminate.
>
> ALPS is decentralized, which means that master or hubs do not have
central
pools. All the subtrees/nodes are stored local in workers. Also, ALPS
works on
subtrees, instead of nodes.
>
> Probably, your problem is due to workload on workers is not balanced.
If
node processing time is short, you can try root-initialization. You can
also
adjust unit work manually. There is no good manual so far, I attached my
thesis for you.
>
> I haven't got change to work on soluton limit. There are quite a few
day job
need handle first. Will try this weekend...
>
> Take care,
>
> Yan
>
>
>
>
> -----Original Message-----
> From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
> Sent: Wednesday, January 09, 2008 3:30 AM
> To: Yan Xu
> Subject: Questions on ALPS
>
> Hi Yan,
>
> Happy New Year! All the best for the New Year to you!
>
> Not sure if you're back to work, I have 2 questions:
> 1. When my program (ALPS based) has been running long, e.g. 12 hours,
> then most processes are in idle state, and the program seems to run
> forever. In this case I turn on the "balance" flag for both inter and
> intra cluster.
> I guess the problem is due to the node pool in the hub is either full,
> or have no job in queue.
> In the first case I see no setting for the "node pool limit", is there
> any option like that?
> In the second case, I guest I should play the parameters
> Alps_donorThreshold, Alps_receiverThreshold, and Alps_unitWorkNodes.
For
> Alps_unitWorkNodes, is this the number of nodes that the worker
> processes in a "loop"? For two new parameters Alps_needWorkThreshold
and
> Alps_changeWorkThreshold, what is their usage?
>
> 2. Please let me know when you finish updating for the "solution
limit"
> parameter.
>
> Thanks,
> Khoa
>
> -----Original Message-----
> From: Yan Xu [mailto:Yan.Xu at sas.com]
> Sent: Tuesday, December 04, 2007 10:45 PM
> To: Khoa Vo
> Subject: RE: [Fwd: [Cops] How to stop ALPS searching as soon as a
> solutionfound]
>
> Khoa,
>
> Hopefully this weekend. I'm a bit busy these days. If I can't push
> anything this weekend, probably it will be next January, because I
will
> be out for vacation for three weeks.
>
> Not sure forceTerminate_ works. Feel free to try. But, the way I
> mentioned before should works although cumbersome.
>
> best,
> Yan
>
> -----Original Message-----
> From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
> Sent: Tuesday, December 04, 2007 4:36 PM
> To: Yan Xu
> Subject: RE: [Fwd: [Cops] How to stop ALPS searching as soon as a
> solutionfound]
>
> Hi Yan,
>
> Thanks for the quick response.
> Do you have a rough estimate of when the new enhancement will come?
>
> I also notice a protected variable forceTerminate_ in the class
> AlpsKnowledgeBrokerMPI, may I use it for forcing the termination?
>
> Khoa
>
> -----Original Message-----
> From: Yan Xu [mailto:Yan.Xu at sas.com]
> Sent: Tuesday, December 04, 2007 7:24 PM
> To: Khoa Vo
> Cc: cops at list.coin-or.org
> Subject: RE: [Fwd: [Cops] How to stop ALPS searching as soon as a
> solutionfound]
>
> I guess should be
>
> if (getKnowledgeBroker()->getIncumbentValue() < ALPS_INC_MAX) {
>     // Found a solution, fathom this node immediately.
>     setStatus(AlpsNodeStatusFathomed);
>     return false;       // no solution found at this node
> }
>
> -----Original Message-----
> From: cops-bounces at list.coin-or.org
> [mailto:cops-bounces at list.coin-or.org] On Behalf Of Yan Xu
> Sent: Tuesday, December 04, 2007 1:12 PM
> To: Khoa Vo
> Cc: cops at list.coin-or.org
> Subject: RE: [Fwd: [Cops] How to stop ALPS searching as soon as a
> solution found]
>
> Oh, I see what you meant. Unfortunately, cannot trigger limits in
> process(). I guess what you can do is in process() function is to add
>
> if (getKnowledgeBroker()->getIncumbentValue() < ALPS_INC_MAX) {
>     // Found a solution, fathom this node immediately.
>     setStatus(AlpsNodeStatusFathomed);
> }
>
> at the begin of your process() function. So it won't process further
on
> this node.
>
> Hope it works and save some running time.
>
> Yan
>
>
>



--









More information about the Cops mailing list