[Cops] Questions on ALPS

Mon Jan 28 09:36:42 EST 2008

Khoa,

I don't notice idle is such big issue. I guess "case #2" looks the reason. Can you set

Alps_staticBalanceScheme  0
Alps_masterInitNodeNum 10000
Alps_hubInitNodeNum 20000
Alps_unitWorkNodes 300

or play around these parameters to see if it helps.

If using one hub, the the hub is also the master.

Best,
Yan

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Monday, January 28, 2008 9:20 AM
To: Yan Xu
Subject: FW: Questions on ALPS

Hi Yan,

I'm not sure if you are aware of this issue, so I send this email again.
With your expertise, what is the reason that most of processes sleeping
after running about 6-7 hours? I leave all parameters to default, except
the Alps_staticBalanceScheme is set to 0.

Thanks,
Khoa

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Tuesday, January 22, 2008 6:27 PM
To: 'Yan Xu'
Cc: 'cops at list.coin-or.org'
Subject: RE: Questions on ALPS

Thanks Yan.
So what is the reason of most processes sleeping after running 5-7
hours, while the tasks are not finished (or the search space is not yet
empty)? I notice the process memory is about 0.4%, which is almost the
same as from the beginning which uses 99%CPU.

There are two possibilities:
1. Worker process gets stuck in one node (the sub search tree cannot
return yes/no answer). However in this case the CPU percentage of that
process should be 99%. So I think it is not the case
2. Worker processes don't have enough nodes to do, so they are waiting
from the hub and master. In my case there is only one hub process, is it
the same as the master process?

I think case #2 makes more sense. As the static balance scheme is root
initialization, does it mean that the master doesn't process and
allocate enough tasks? Or this is due to the dynamic balance scheme,
which may caused by the hubReportPeriod, masterBalancePeriod or other
parameters? Two parameters receiverThreshold and donorThreshold don't
seem to play any role here, because most worker processes are idle.

What do you think?
Khoa

-----Original Message-----
From: Yan Xu [mailto:Yan.Xu at sas.com]
Sent: Tuesday, January 22, 2008 5:38 PM
To: Khoa Vo
Cc: cops at list.coin-or.org
Subject: RE: Questions on ALPS

Khoa,

free feel to post email on the mailing list.

I look at the logs. It looks to me that the node processing time is
relatively short (130670901 nodes was processed in 6760.53 CPU sec.).

probably you can try set parameters

Alps_masterInitNodeNum 10000
Alps_hubInitNodeNum 20000
Alps_unitWorkNodes 300

or just play around these parameters to see if it helps.

For the instance of bandwidth 18, it doesn't look too bad.

Yan

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Tuesday, January 22, 2008 6:24 AM
To: Yan Xu
Subject: RE: Questions on ALPS

If you cannot receive compressed file, here are the plain files.
Khoa

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Tuesday, January 22, 2008 12:21 PM
To: 'Yan Xu'
Subject: RE: Questions on ALPS

Hi Yan,

I tried the automatic parameters as you advised. The result looks
better.
Attached is the RAR archive of parameter file, and two log files running
the same testing instance. One has no solution for bandwidth 18, and one
can find quickly a solution with bandwidth 20. With the same parameters,
searching for bandwidth 20 seems to be forever (which is correct, as the
search space is exponentially larger than that of bandwidth 18).
What needs notice here is after about 7 hours, most of the processes are
in idle state, and run at about 19% of the CPU. I tried playing around
with the parameters but nothing improved. Do you know why, how do I fix
that?

Other question, should I copy cops at list.coin-or.org directly on the
email, or let you do it by posting to the list?

Thanks,
Khoa

-----Original Message-----
From: Yan Xu [mailto:Yan.Xu at sas.com]
Sent: Wednesday, January 16, 2008 6:21 PM
To: cops at list.coin-or.org; Khoa Vo
Subject: RE: Questions on ALPS

Khoa,

"-Alps_masterBalancePeriod 60" specifies that the master balance
workload every 60 nodes(processed); "-Alps_hubReportPeriod 45" specifies
that the hubs report the search state (work quality/quantify, numbers of
message send or received, etc) every 45 nodes (processed).

"-Alps_masterInitNodeNum 200000000" says the master will generate
200000000 nodes for hubs during rampup; and "-Alps_hubInitNodeNum
40000000" says a hub will generate 40000000 nodes for its worker during
rampup.

Note Alps_masterInitNodeNum and Alps_hubInitNodeNum are only valid when
set Alps_staticBalanceScheme to 'root initializaton' (default is spiral,
see AlpsParams.h).

200000000 and 40000000 are way too large. Basically means, search will
be stucked in rampup and master does all the search.

"Alps_unitWorkNodes 10000000" specifies that a worker will processes
10000000 nodes and then check/handle messages, then processes 10000000
nodes, and ...

10000000 is too big to be a unit of work. Dynamically load balancing
have little chance to kick in.

I will suggest first set Alps_staticBalanceScheme to 0 (root
initialization), and untouch all other parameters,  and see the results.

Is it possible to share the logs.

Yan

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Tuesday, January 15, 2008 12:36 PM
To: Yan Xu
Subject: RE: Questions on ALPS

Hi Yan,

Thanks, now I understand why the program didn't stop. I will set these
initNodeNum smaller. However I need to test to avoid "sleeping" states
when running more than a few hours.
At the moment I run my program on a cluster with five 4-core machines.
Sorry I didn't make an average statistics of time for processing a node,
as I tested many instances. I am switching to use the bigger cluster of
the university, expected up to 128 nodes.

Is there any rule or hint of setting these parameters? Changing them
from default to following values did improve my program (no sleeping
process in long running).
 -Alps_masterBalancePeriod 60
 -Alps_hubReportPeriod 45
 -Alps_masterInitNodeNum 200000000
 -Alps_hubInitNodeNum 40000000
 -Alps_unitWorkNodes 10000000

Khoa

-----Original Message-----
From: Yan Xu [mailto:Yan.Xu at sas.com]
Sent: Tuesday, January 15, 2008 3:06 PM
To: Khoa Vo
Subject: RE: Questions on ALPS

Khoa,

If using root initialization and setting Alps_masterInitNodeNum to
200000000, then the master will generate 200000000 nodes for other
processors during rampup. It has good change that master can not pass
the ramp-up phase. ALPS does not check if reaching time limit during
rampup.

Also the unit work seems very large.

Probably you can make

-Alps_masterInitNodeNum 200000000
-Alps_hubInitNodeNum 40000000
-Alps_unitWorkNodes 10000000

much smaller, or let ALPS choose a value for you (comment out the
parameters)

how many processors are you using? how long does it take to process a
node?

Yan

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Tuesday, January 15, 2008 6:08 AM
To: Yan Xu
Subject: RE: Questions on ALPS

Hi Yan,

Sounds great! Thanks for your fix. I will try it and let you know the
result.

I have another question regarding the time limit. I tried changing
(increasing) to avoid "sleeping states" when program is running long:
 -Alps_masterBalancePeriod 60
 -Alps_hubReportPeriod 45
 -Alps_masterInitNodeNum 200000000
 -Alps_hubInitNodeNum 40000000
 -Alps_unitWorkNodes 10000000

Then the time setting with Alps_timeLimit doesn't work anymore. The
program continues running after the time is exceeded.
Is this by design intent, or did I have something wrong?

Thanks,
Khoa

-----Original Message-----
From: Yan Xu [mailto:Yan.Xu at sas.com]
Sent: Monday, January 14, 2008 9:46 PM
To: Khoa Vo
Subject: RE: Questions on ALPS

Khoa,

I think I fixed the solution limit issue. You can use Alps_solLimit to
stop search. For simplicity, solution limit is not checked during
rampup. Hopefully, it works fine. The change was committed into trunk,
you can get trunk by

svn checkout https://projects.coin-or.org/svn/CHiPPS/Alps/trunk
Alps-trunk

regards,
Yan

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Thursday, January 10, 2008 5:07 AM
To: Yan Xu
Subject: RE: Questions on ALPS

Hi Yan,

Thanks, I am playing around with the parameters.
Let me know when you have the "solution limit" available.

Best,
Khoa

Yan Xu <Yan.Xu at sas.com> said:

> Khoa,
>
> Wish you a happy new year too!
>
>
https://projects.coin-or.org/CHiPPS/browser/Alps/trunk/Alps/src/AlpsPara
ms.h
>
> has some explanation of the ALPS parameters.
>
> Alps_needWorkThreshold: a worker will ask for subtree/nodes if its
workload
is below the value specified by this parameter.
> Alps_changeWorkThreshold: a worker will quit working on the current
subtree
if its quality is worse than the best one in subtree pool for certain
amount.
> Alps_unitWorkNodes: a worker will process the number of nodes before
checking message. Your understanding is right. Basically, a worker will
process a unit of nodes, then handling message, and then process a unit
of nodes,..., until terminate.
>
> ALPS is decentralized, which means that master or hubs do not have
central
pools. All the subtrees/nodes are stored local in workers. Also, ALPS
works on subtrees, instead of nodes.
>
> Probably, your problem is due to workload on workers is not balanced.
If
node processing time is short, you can try root-initialization. You can
also adjust unit work manually. There is no good manual so far, I
attached my thesis for you.
>
> I haven't got change to work on soluton limit. There are quite a few
day job
need handle first. Will try this weekend...
>
> Take care,
>
> Yan
>
>
>
>
> -----Original Message-----
> From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
> Sent: Wednesday, January 09, 2008 3:30 AM
> To: Yan Xu
> Subject: Questions on ALPS
>
> Hi Yan,
>
> Happy New Year! All the best for the New Year to you!
>
> Not sure if you're back to work, I have 2 questions:
> 1. When my program (ALPS based) has been running long, e.g. 12 hours,
> then most processes are in idle state, and the program seems to run
> forever. In this case I turn on the "balance" flag for both inter and
> intra cluster.
> I guess the problem is due to the node pool in the hub is either full,
> or have no job in queue.
> In the first case I see no setting for the "node pool limit", is there
> any option like that?
> In the second case, I guest I should play the parameters
> Alps_donorThreshold, Alps_receiverThreshold, and Alps_unitWorkNodes.
For
> Alps_unitWorkNodes, is this the number of nodes that the worker
> processes in a "loop"? For two new parameters Alps_needWorkThreshold
and
> Alps_changeWorkThreshold, what is their usage?
>
> 2. Please let me know when you finish updating for the "solution
limit"
> parameter.
>
> Thanks,
> Khoa
>
> -----Original Message-----
> From: Yan Xu [mailto:Yan.Xu at sas.com]
> Sent: Tuesday, December 04, 2007 10:45 PM
> To: Khoa Vo
> Subject: RE: [Fwd: [Cops] How to stop ALPS searching as soon as a
> solutionfound]
>
> Khoa,
>
> Hopefully this weekend. I'm a bit busy these days. If I can't push
> anything this weekend, probably it will be next January, because I
will
> be out for vacation for three weeks.
>
> Not sure forceTerminate_ works. Feel free to try. But, the way I
> mentioned before should works although cumbersome.
>
> best,
> Yan
>
> -----Original Message-----
> From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
> Sent: Tuesday, December 04, 2007 4:36 PM
> To: Yan Xu
> Subject: RE: [Fwd: [Cops] How to stop ALPS searching as soon as a
> solutionfound]
>
> Hi Yan,
>
> Thanks for the quick response.
> Do you have a rough estimate of when the new enhancement will come?
>
> I also notice a protected variable forceTerminate_ in the class
> AlpsKnowledgeBrokerMPI, may I use it for forcing the termination?
>
> Khoa
>
> -----Original Message-----
> From: Yan Xu [mailto:Yan.Xu at sas.com]
> Sent: Tuesday, December 04, 2007 7:24 PM
> To: Khoa Vo
> Cc: cops at list.coin-or.org
> Subject: RE: [Fwd: [Cops] How to stop ALPS searching as soon as a
> solutionfound]
>
> I guess should be
>
> if (getKnowledgeBroker()->getIncumbentValue() < ALPS_INC_MAX) {
>     // Found a solution, fathom this node immediately.
>     setStatus(AlpsNodeStatusFathomed);
>     return false;       // no solution found at this node
> }
>
> -----Original Message-----
> From: cops-bounces at list.coin-or.org
> [mailto:cops-bounces at list.coin-or.org] On Behalf Of Yan Xu
> Sent: Tuesday, December 04, 2007 1:12 PM
> To: Khoa Vo
> Cc: cops at list.coin-or.org
> Subject: RE: [Fwd: [Cops] How to stop ALPS searching as soon as a
> solution found]
>
> Oh, I see what you meant. Unfortunately, cannot trigger limits in
> process(). I guess what you can do is in process() function is to add
>
> if (getKnowledgeBroker()->getIncumbentValue() < ALPS_INC_MAX) {
>     // Found a solution, fathom this node immediately.
>     setStatus(AlpsNodeStatusFathomed);
> }
>
> at the begin of your process() function. So it won't process further
on
> this node.
>
> Hope it works and save some running time.
>
> Yan
>
>
>

--