For a few years now, I have been trying to understand OpenMP performance, especially working with large arrays.
Two frequent problems with testing OpenMP implementations are:
- insufficient workload for the OpenMP DO loop; a trivial calculation in the loop is not going to overcome the overhead of initiating the !$OMP region. It takes about 5 micro seconds to initiate a region. That is about 20,000 processor cycles, which looks huge to me. There is also a slight overhead for SCHEDULE(DYNAMIC) vs SCHEDULE(STATIC). Dynamic can be preferred where the thread workloads can be variable. The DYNAMIC overhead can be a minor issue but does highlight the problem of balancing workload between threads.
- Increased thread counts can involve increased memory demand. When the memory demand of the combined thread calculation exceeds the cache size, memory access can quickly exceed the memory bandwidth, stalling the thread gains. This appears to be a black art that I am yet to master. A simple OpenMP example is dot_product. Looks good, but to scale up to overcome the startup delay, it will always fail on memory bandwidth. There might be a sweet spot for array size, but my real problems never have that characteristic.
Minor speed differences between gFortran and iFort may come down to optimisation strategies, especially for IF usage or possibly positioning for use of L1 cache.
The more important question should be is OpenMP providing a significant improvement from the single thread case. Hopefully this is a more significant gain than between gFortran and iFort. Where OpenMP is not providing a gain, this is a more challenging problem.
As I am using gFortran for OpenMP, it is good to know both compilers are sharing the better performance for different calculations.