I think that the reason that Ron's compiled code is slower than the single threaded binary is probably related to -DSETNUMTHREADS You absolutely need to use this switch if you're linking with MKL. Otherwise, you'll end up with n^2 threads running, where n is the number of processor cores, and this can cause massive performance problems.