Poor openmp scaling with ifort but not gfortran

I do not use OOP at all, but what you are describing (much higher L1 replacement) could suggest that iFort is storeing the type structure in a different way to gFortran.

My approach with derived types is to use them as a problem definition data structure, but use local arrays for inner loop calculation.
My reasoning is that I have a better understanding of the way the calculated data is stored and referenced in the inner loop and so I might better understand the memory demands of the inner loop. My understanding assumes OOP is more expensive for memory usage.

This approach helps my understanding of performance, but others may disagree. Some OOP implementations report good performance which could imply my simplified understanding is wrong.

I do think poor memory usage is a significant failing in my OpenMP implementations.

I do have an interesting time stepping example, when using a shared 20-30 GByte array in a multi-threaded problem. By using !$OMP BARRIER to keep all threads at a similar stage of repeated calculation, this achieves a greater % L3 cache sharing between threads. This results in a halving of total elapsed time (from 5 hours to 3 hours). I look forward to testing this on a new processor with better DDR5 memory bandwidth to see if this supports my understanding of the problem.
The calculation approach demands each thread must scan the large array twice for each time step and I can’t see a way to redefine the group calculation.
This is with dual channel memory. I don’t know what more channels would provide ? And throw in different classes of cores, there is always a new twist to understand !!

Some types of calculations are more suited to OpenMP than others.