Poor openmp scaling with ifort but not gfortran

I can confirm the findings using ifort (IFORT) 19.1.2.254 20200623 and gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1~20.04) on an Intel(R) Xeon(R) CPU E5-2687W 0

maws01 ➜  gfortran -Ofast -march=native -fopenmp *.f90 
maws01 ➜  ./a.out
 Calling parallel marbles with            1  threads.
 Loop time = 3.174000 seconds.
 Speedup = 1.000000x.
 ------------------------------------------------------
 Calling parallel marbles with            4  threads.
 Loop time = 0.818000 seconds.
 Speedup = 3.880196x.
 ------------------------------------------------------
maws01 ➜  ifort -fast -xHost -qopenmp *.f90
ld: /opt/intel/compilers_and_libraries_2020/linux/lib/intel64/libiomp5.a(ompt-general.o): in function `ompt_pre_init':
(.text+0x2281): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
maws01 ➜  ./a.out 
 Calling parallel marbles with            1  threads.
 Loop time = 4.291700 seconds.
 Speedup = 1.000000x.
 ------------------------------------------------------
 Calling parallel marbles with            4  threads.
 Loop time = 41.886398 seconds.
 Speedup = 0.102460x.
 ------------------------------------------------------

The poor performance is directly related to

call parser%evaluate(marble(1:3), marble(4:6))

replacing this with

marble(1:3) = evaluate(marble(1:3) , marble(4:6))

and

elemental function evaluate(a,b) result(c)                                                          
  use iso_fortran_env                                                                               
  real(real64), intent(in) :: a,b                                                                   
  real(real64) :: c                                                                                 
                                                                                                    
  c = (a*b)**2                                                                                      
end function 

gives a near optimal speedup for gfortran and ifort.

I must admit that I did not look into parser%evaluate in detail, but it seems quite complex with many branches, testing for the status of allocated arrays etc. Such things should be avoided in a hot loop.