[CHiPPS] Setting physical memory limit

Mon Feb 11 09:22:25 EST 2008

Hi Yan,

It is 256 processes, and not in the ramp-up phase.
Is it possible that this happens on the first hub, which also functions
as the master? The hub generates lots of nodes itself and keeps these
children nodes in its pool.

Khoa

-----Original Message-----
From: Yan Xu [mailto:Yan.Xu at sas.com] 
Sent: Monday, February 11, 2008 3:12 PM
To: Khoa Vo
Cc: cops at list.coin-or.org
Subject: RE: Setting physical memory limit

Khoa,

Does this happen during rampup? (ALPS prints some message to screen, and
you can tell if it is in rampup phase).

If it is, then you have to reduce the

Alps_masterInitNodeNum 2000 #3000

But, 2000 nodes should not use 8 G RAM unless they are huge nodes.

If it is not in rampup phase, then the master doen't not have any nodes
in pool because the master in ALPS doesn't process node after rampup
phase, and all the nodes are distrubed to other processors. So it is
probably that a worker has too much nodes and load balancing works
poorly.

If using 4 processes, increasing number numbers of hubs will not help,
and it is better just using 1 hub.

Best,
Yan

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Saturday, February 09, 2008 5:31 AM
To: Yan Xu
Cc: cops at list.coin-or.org
Subject: RE: Setting physical memory limit

Hi Yan,

Having strong inequalities is our goal, but while waiting for them I
still have to reply on parallel power :)

Below is the ps output for the last crashed job, on the first node:
kvo      14128 96.7 93.4 8159984 7716816 ?     Dsl  10:42   6:31 ./bwp
-param lns_131-19.par
kvo      14129 93.4  0.1 153504 14832 ?        Rsl  10:42   6:17 ./bwp
-param lns_131-19.par
kvo      14130 97.5  3.0 324516 253312 ?       Dsl  10:42   6:33 ./bwp
-param lns_131-19.par
kvo      14131 97.6  0.2 100200 19272 ?        Dsl  10:42   6:34 ./bwp
-param lns_131-19.par

I think memory overflow happens only on the master. Does that mean the
master keeps so many nodes not delivering to the hubs? How can I "push"
these unprocessed nodes evenly to the hubs? By increasing number of
hubs?

Khoa

-----Original Message-----
From: Yan Xu [mailto:Yan.Xu at sas.com]
Sent: Friday, February 08, 2008 4:23 PM
To: Khoa Vo
Cc: cops at list.coin-or.org
Subject: RE: Setting physical memory limit

Khoa,

nodeLimit is to specify how many nodes can ALPS process. When reaching
this number, ALPS will terminate and return node limit solution status.

The checkMemory parameter is to report how many memory used by ALPS, I
only test it for serial code on Linux machine.

Since difference platform has different way to get memory info, ALPS
doesn't check memory usage to switch strategy.

For large instance, I guess you can try set ALPS_searchStrategy to
depth-first.  Also, try come up a method that tends to generate fewer
nodes (like strong branching in solving MILPs).

Hope it helps.

Best,
Yan

-----Original Message-----
From: Khoa Vo [mailto:khoa.vo at informatik.uni-heidelberg.de]
Sent: Friday, February 08, 2008 9:47 AM
To: Yan Xu
Cc: cops at list.coin-or.org
Subject: Setting physical memory limit

Hi Yan,

My ALPS-based program is having problem with memory limitation. In our
cluster, each node (quad-cores processor) has a limitation of 8GB RAM.
One of our testing instances has caused a crash, as the memory it takes
on one node exceeds 8GB.

Is there any parameter in ALPS that I can set a limit of memory? That
means if the total physical memory used in one process exceeds a certain
limit, children ALPS nodes will not be generated anymore? I noticed the
two following parameters, but no number for the memory.

/** The max number of nodes can be processed.
112           Default: ALPS_INT_MAX */
113       nodeLimit,

/** Check memory.
43           Default: false */
44       checkMemory,

Thanks,
Khoa

==================My parameter files=======================
Alps_instance ../data/Harwell-Boeing_small/lns__131.mtx.rnd

Bw_labelingDirection 0

Bw_upperBound 19

Bw_lowerBound 1

Alps_solLimit 1

#Alps_nodeLimit 5000000

Alps_logFileLevel 0

Alps_msgLevel 1
Alps_hubMsgLevel 0
Alps_workerMsgLevel 0

Alps_hubNum 8

Alps_interClusterBalance 1      # 1: balancing load, 0: don't.
Alps_intraClusterBalance 1      # 1: balancing load, 0: don't.

#Alps_searchStrategy 0 # 0: Best, 1: Best-est, 2: Breath, 3: Depth, 4
hybrid
#Alps_searchStrategyRampUp  0

#Alps_staticBalanceScheme 0

#Alps_nodeLogInterval 1000
#Alps_hubWorkClusterSizeLimit 1

Alps_masterInitNodeNum 2000 #3000
Alps_hubInitNodeNum 300 #5000
Alps_unitWorkNodes 10

#Alps_hubWorkClusterSizeLimit 3

#Alps_needWorkThreshold 0.5
#Alps_changeWorkThreshold 0.10

#Alps_donorThreshold 0.3
#Alps_receiverThreshold 0.3

#Alps_masterBalancePeriod 60 #0.3
#Alps_hubReportPeriod 30 #0.5