I can confirm the findings using ifort (IFORT) 19.1.2.254 20200623 and gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1~20.04) on an Intel(R) Xeon(R) CPU E5-2687W 0
maws01 ➜ gfortran -Ofast -march=native -fopenmp *.f90
maws01 ➜ ./a.out
Calling parallel marbles with 1 threads.
Loop time = 3.174000 seconds.
Speedup = 1.000000x.
------------------------------------------------------
Calling parallel marbles with 4 threads.
Loop time = 0.818000 seconds.
Speedup = 3.880196x.
------------------------------------------------------
maws01 ➜ ifort -fast -xHost -qopenmp *.f90
ld: /opt/intel/compilers_and_libraries_2020/linux/lib/intel64/libiomp5.a(ompt-general.o): in function `ompt_pre_init':
(.text+0x2281): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
maws01 ➜ ./a.out
Calling parallel marbles with 1 threads.
Loop time = 4.291700 seconds.
Speedup = 1.000000x.
------------------------------------------------------
Calling parallel marbles with 4 threads.
Loop time = 41.886398 seconds.
Speedup = 0.102460x.
------------------------------------------------------
The poor performance is directly related to
call parser%evaluate(marble(1:3), marble(4:6))
replacing this with
marble(1:3) = evaluate(marble(1:3) , marble(4:6))
and
elemental function evaluate(a,b) result(c)
use iso_fortran_env
real(real64), intent(in) :: a,b
real(real64) :: c
c = (a*b)**2
end function
gives a near optimal speedup for gfortran and ifort.
I must admit that I did not look into parser%evaluate in detail, but it seems quite complex with many branches, testing for the status of allocated arrays etc. Such things should be avoided in a hot loop.