GotoBLAS experiences
GotoBLAS is a high performance, threaded, linear algebra library commonly used in high performance computing. We (the HPC staff at UA Fayetteville) use this library mainly for HPL benchmarking, so that we can get a place on the Top 500 list. I've recently been using the library is HPL benchmarking, but for the purposes of measuring the impact of virtual machine use on application performance. Since HPL is a "standard" benchmarking code, it makes sense to use it in testing. I'm currently testing on three Intel architectures - Nocona, Harpertown, and Gainestown (i.e. Nehalem). On both the Nocona and Harpertown, GotoBLAS works very well (easy to setup, compile, link against, etc.). On the Nehalem, I ran into a big problem.
For some strange reason, GotoBLAS would not use more than 4 cores (on an 8-core box, hyper-threading disabled) no matter how I compiled GotoBLAS or changed the environment variables.
I tried OMP_NUM_THREADS=8 and GOTO_NUM_THREADS=8 (the supported ways to change things at runtime).
In the Makefile.rule file, I changed NUM_THREADS=8 to NUM_THREADS=16, hoping that it was only using 1/2 of the max threads (don't ask why I thought that, it was simply something easy to change and test).
When none of that worked, I decided to take a slightly deeper look.
After perusing some of the initialization code, I saw that enabling OpenMP forces GotoBLAS to handle things differently.
So, that bit of knowledge in hand, I thought "let's change Makefile.rule and set USE_OPENMP=1".
After doing that and recompiling (both GotoBLAS and HPL), setting GOTO_NUM_THREADS=8 at runtime made HPL use 8 threads.
SUCCESS!
With the issue resolved, I'm not going to dig any farther into why that happens, since I have a fast approaching deadline.