Fortran applications using Fortran 2008+ features

Third, the literature contains great examples of coarrays running in important applications at scale. Here are three examples dating as far back as a decade ago:

Speedup of 33% relative to MPI-2 on 80K cores for European weather model:

Mozdzynski, G., Hamrud, M., & Wedi, N. (2015). A Partitioned Global Address Space implementation of the European Centre for Medium Range Weather Forecasts Integrated Forecasting System. International Journal of High Performance Computing Applications , 1094342015576773.

Performance competitive with MPI-3 for several applications:

Garain, S., Balsara, D. S., & Reid, J. (2015). Comparing Coarray Fortran (CAF) with MPI for several structured mesh PDE applications. J ournal of Computational Physics .

Speedup of 50% relative to MPI-2 for plasma fusion code on 130,000 cores:

Preissl, R., Wichmann, N., Long, B., Shalf, J., Ethier, S., & Koniges, A. (2011, November). Multithreaded global address space communication techniques for gyrokinetic fusion applications on ultra-scale platforms. In P roceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 78). ACM.

And nearly every project I’ve worked on for the past ~5 years has involved parallel Fortran 2018, though not always coarrays specifically and not all of the projects are open-source and some are dormant, but a few of the open-source codes are

Sadly, none of these are really the best examples because the first is a proxy application, the second is very early in its development, the third hasn’t yet merged the parallel branch into the default branch, and the fourth is dormant for funding reasons. Nonetheless, the first one played a central role in a Ph.D. dissertation just submitted last month and I was just awarded funding to continue work on it so it so at least the development will continue.

In summary, I don’t think the situation is quite as stark as most imagine, but it’s nowhere near what it could be. Compiler support is considerably broader than most people realize. There have been several published successes with parallel Fortran 2018 running in important applications at scale.

Finally, the best news I’ve seen on do concurrent is that NVIDIA supports offloading do concurrent to GPUs. Researcher Jeff Hammond, who recently moved from Intel to NVIDIA, has privately shared great results from testing this capability with his Parallel Research Kernels. He reported bandwidth comparable to CUDA.

It’s time for users to be more vocal and let the lagging vendors know they want better support for do concurrent. I strongly disagree with the statement that do concurrent has “profound design flaws.” Every problematic case I’ve seen involves pointers or indirect addressing. I rarely use either of these features. While they are common in some applications, especially unstructured-grid applications, there are many useful production applications that never need those features, especially structured-grid applications. If do concurrent is profoundly flawed, then it’s hard to understand why NVIDIA would work on offloading and it’s hard to explain the results Jeff Hammond reported to me privately. I hope we see some publications of such results soon.

Fortran’s adherence to upward compatibility is likely one of the main reasons the language remains in use after more than 60 years. Unless the problematic cases can be addressed without breaking upward compatibility, I think it would be better to propose a replacement feature, e.g., do parallel, with syntax as close to do concurrent as possible so that existing codes can migrate to the new feature easily.

7 Likes