[CppAD] Performance of CppAD and OpenMP

schattenpflanze at arcor.de schattenpflanze at arcor.de
Sat Feb 19 10:27:31 EST 2011


Hello Brad,

>   The test case
> http://www.coin-or.org/CppAD/Doc/example_a11c.cpp.xml
> is an example from the OpenMP standards document. It does not use CppAD
> at all and is intended to tests the limitations of your system and
> compiler.  It is one of the cases run by openmp/run.sh. I have found
> that, for some systems and compilers, it has the type of performance
> that you describe below.
Thank you for pointing this out. I checked the source code. Indeed, I 
think that the bad scaling of example_a11c.cpp is not related to my 
problem. In that example, an "inner loop" is parallelized and repeated 
about 1e5 times. There is only a very small amount of work to do for 
each loop index. The overhead for thread-creation and thread-starting is 
larger than the payload itself. Calling the loop so often just 
multiplies the time for the overhead. In addition, the modified data is 
shared, so cache line invalidations may occur.

The minimal example I listed in my previous mail, however, is of a 
different nature. The outermost loop is parallelized, so we have the 
effort of creating the threads only once. The work load per (outer) loop 
index is large enough for reasonable parallelization. Moreover, the 
threads do not modify shared data which could cause cache misses. In 
fact, the work done for each loop index should be independent of every 
other index. Without the "independent" clause, the scaling is good, 
although the workload and runtime is smaller.

Best regards,
Peter



>
> On 2/18/2011 12:30 PM, schattenpflanze at arcor.de wrote:
>> Hello,
>>
>> I have another question concerning the performance of CppAD when
>> OpenMP is enabled. It seems that CppAD scales very badly when the
>> number of threads and cores exceeds a certain number. I have tried to
>> construct a minimal example reproducing the issue. Running the simple
>> (and absolutely pointless) example code listed below on a machine with
>> 32 native cores (no hyperthreading, single workstation) yields the
>> following results:
>> 1 thread:  8.6 seconds
>> 4 threads: 2.8 s
>> 8 threads: 2.2 s
>> 10 threads: 2.4 s
>> 12 threads: 4 s
>> 14 threads: 3.8 s
>> 16 threads: 4.2 s
>> 24 threads: 8.1 s (!)
>> 28 threads: 9.5 s (!)
>>
>> I am, of course, aware that additional threads cause additional
>> overhead, and that the performance does not necessarily increase with
>> the number of threads. However, this significant _decrease_ seems
>> strange. In particular, if I remove the line
>> CppAD::Independent(x)
>> from the code, I obtain:
>> 4 threads: 0.38 s
>> 8 threads: 0.20 s
>> 16 threads: 0.14 s
>> 24 threads: 0.12 s,
>> which is the kind of scaling that I would have expected.
>>
>> Memory consumption seems to be low. I have tried various scheduling
>> and variable sharing policies, but the problem persists. I also attach
>> the interesting results of the CppAD openmp test script. What is the
>> reason for this behaviour and how can I counter it?
>>
>> Thank you and best regards,
>> Peter
>>
>>
>> Test code:
>> ----------------------------------------------------
>> int n_par = 45;
>> CppAD::vector<AD<double> > x(n_par);
>> for (int i=0; i<n_par; ++i) {
>>   x[i] = i;
>> }
>>
>> #pragma omp parallel for \
>>     firstprivate(x) \
>>     schedule(dynamic,1) \
>>     num_threads(global_paras->n_threads)
>> for (int i=0; i<1000; ++i) {
>>   CppAD::Independent(x);
>>   CppAD::vector<AD<double> > y(1);
>>
>>   y[0] = 0.0;
>>   for (int i=0; i<1000; ++i) {
>>     for (int j=0; j<(int)x.size(); ++j) {
>>       y[0] += CppAD::pow(x[j] - y[0], 0.1);
>>     }
>>   }
>> }
>>
>>
>> _______________________________________________
>> CppAD mailing list
>> CppAD at list.coin-or.org
>> http://list.coin-or.org/mailman/listinfo/cppad
>
>
>
> _______________________________________________
> CppAD mailing list
> CppAD at list.coin-or.org
> http://list.coin-or.org/mailman/listinfo/cppad



More information about the CppAD mailing list