OpenMP and array operations

I have a program (or rather versions of a program) that uses array operations, but I would like to examine if it is possible to accelerate it using OpenMP. Are there any OpenMP directives that would allow that? For instance updating a two-dimensional array:

u = u + deltt * du

where both u and du are two-dimensional arrays.

I know how to do it using an explicit double do-loop, but that is not the point :slight_smile: .

1 Like

To answer my own question, after another more successful search: the workshare construct might do it.

2 Likes

There are some caveats apparently, but part of my experiments is to see the effect of different implementations and compiler options on the performance. So, that is all in the game.

1 Like

It would be interesting to compare the performance of these methods:

(1) Array syntax (2 dim array) with openmp workshare directive

(2) nested do loops with openmp parallel do collapse(2)

(3) do concurrent

Thanks for the suggestions. I had not thought of do concurrent, I must admit. The number of combinations (implementations and compilers and compiler options and problem sizes) is already quite large, but the results so far are, well, intriguing and puzzling.

1 Like

I have implemented a version of my program using the OpenMP workshare directive. The performance was abominable :slight_smile: : a calculation that originally took 300 seconds, now takes 2560 seconds with one thread and 2170 seconds with two threads. This was with gfortran. As the total set of calculations had not finished over night, I still have to look at the results with ifx.

No experiment yet with do concurrent, as I focused on a rather radically different implementation using OpenMP. Unfortunately, the combination of a parallel section and a do-while loop with a barrier inside that loop is forbidden. Exit my attempt to be smart and have the parallel section open during most of the calculation.

1 Like

I have run the ifx version for a small number of cases (I did not have much time, or indeed patience - I prefer to let these calculations run overnight), but I do not see the dramatic deterioration I saw with gfortran. On the other hand, the calculations with two cores are slower than with one core. Not a good sign :slight_smile: .

1 Like

My approach to using omp for this calculation is as follows

   subroutine do_omp_u ( u, du, deltt, n, m )
       integer :: n,m
       real(8) :: u(n,m)
       real(8) :: du(n,m)
       real(8) :: deltt

       integer :: j

     !$OMP PARALLEL DO shared (u, du, deltt, n, m) private (i)
       do j = 1,m
!!       call daxpy (n, deltt, du(1,j), u(1,j) )
         u(:,j) = u(:,j) + deltt * du(:,j)
       end do
     !$OMP END PARALLEL DO

   end subroutine do_omp_u

  subroutine daxpy ( n, a, X, Y )
!
!   Performs the vector operation  [Y] = [Y] + a * [X]
!
   integer n 
   real*8  X(n), Y(n), a 
! 
   if ( n == 1 ) then
      Y(1) = Y(1) + a * X(1)

   else if ( n > 0 ) then
      Y = Y + a * X

   end if

  end subroutine daxpy

It basically assumes contiguous memory and that there is enough computation to justify initiating the omp region.

I may be wrong if you see a use for the OpenMP workshare directive.
PARALLEL DO could work well for large m and n, unless there is a memory transfer bottleneck or subroutine do_omp_u is called millions of times. The efficiency comes from contiguous memory processing.

1 Like

Nice example! The private variable should be i not j

Not quite: the do-loop variable is automatically private. So the clause is actually superfluous.

1 Like

Could you share at least parts of your code? What you are attempting to do is not completely clear to me…

1 Like

I have a small question on your example: why the if condition in the subroutine? I.e. why treat separately the case with one element in the arrays?

1 Like

Sure, here is the version I talked about, code and a sample input file.

poisson_island_pointer.inp.txt (385 Bytes)

poisson_island_pointer_openmp.f90 (5.8 KB)

2 Likes

I don’t know if it’s the actual source code you are compiling, but it misses the $ in all the OpenMP directives. Assuming it was intended (just to run the code sequentially), setting the right !$omp was not enough, since you are opening a workshare section without closing the previous one (compilation error in both gfortran and ifort).

Anyway, I got this compilable code:

        !$omp parallel
        !$omp workshare
        pderiv = diffw * pwest  + &
                 diffe * peast  + &
                 diffn * pnorth + &
                 diffs * psouth   &
                 - (diffw + diffe + diffn + diffs) * pcentre + pforce
        u = u + deltt * du
        !$omp end workshare
        !$omp end parallel

But the execution fails (segmentation violation) with both gfortran and ifort (on Linux). This is probably because each thread tries allocating a big temporary array in its own stack space. Assigning the result to an allocatable tmp(:,:) array instead of the pointer pderiv(:,:) one (and then copying to pderiv) fixes the problem. The very same problem actually happens with gfortran even without any OpenMP in the code. So at the end what I’m testing is rather:

real, allocatable :: tmp(:,:)
...
        !$omp parallel
        !$omp workshare
        tmp = diffw * pwest  + &
              diffe * peast  + &
              diffn * pnorth + &
              diffs * psouth   &
              - (diffw + diffe + diffn + diffs) * pcentre + pforce
        pderiv = tmp
        u = u + deltt * du
        !$omp end workshare
        !$omp end parallel

gfortran, no OpenMP: 10.0" (1000 iterations max)
ifort, no OpenMP: 10.3"

gfortran, OpenMP workshare (1 thread): 15.6"
ifort, OpenMP workshare (1 thread): 10.8"

gfortran, OpenMP workshare (4 threads): 15.4"
ifort, OpenMP workshare (4 threads): 9.6"

And finally with classical $omp parallel do:

gfortran, OpenMP do loops (4 threads): 5.8"
ifort, OpenMP do loops (4 threads): 4.2"

So yes, workshare doesn’t speed-up anything here.

For Intel Fortran I use the option -heap-arrays. And I use the workshare directive, not the workshare construct, as the array operations in question are single statements. I was not aware of the problem with gfortran on Linux (I have been using Windows for this stuff).

But the code is the actual code. I forgot to add the timing module, but that is all.

1 Like

@Arjen
I looked at poisson_island_pointer_openmp.f90.
Most of your arrays are via pointers and don’t appear to be contiguous.
I am not familiar with this usage, but this could be a big problem for using AVX instructions in the inner loops.

In the example I posted, I included my subroutine daxpy as I have used this successfully in production code to include AVX efficiency, and included

!$OMP  PARALLEL DO DEFAULT (NONE)                      &
!$OMP& PRIVATE (I, J, ...)                        &
!$OMP& SHARED  ( ... )                            &
!$OMP& num_threads (threads_to_use)                    &
!$OMP& SCHEDULE(STATIC)

To be effective in your case, “n” should not be too small.

Gfortran gives good diagnostics with these settings and -fimplicit-none

This is a part of the code you have uploaded: all OpenMP directives are written !omp xxx instead of !$omp xxx. Therefore these lines are ignored even if -fopenmp is present in the compilation options…

Now, if I add the missing $ on the OpenMP lines, here is what I get when compiling with ifort:

% ifort -fopenmp arjen0.F90
arjen0.F90(47): error #6404: This name does not have a type, and must have an explicit type.   [LUREP]
    lurep = 20 ! Just one output file
----^
arjen0.F90(118): error #7972: An OpenMP* WORKSHARE directive is not permitted in a WORKSHARE directive that binds to the same PARALLEL directive.
!$omp workshare
------^
arjen0.F90(121): error #7628: This OpenMP* END directive does not match the OpenMP* block directive at the top of the stack.
!$omp end parallel
----------^
arjen0.F90(134): error #7653: There are unterminated OpenMP* directive block[s].
        write( 20, '(*(g11.4))') u(:,iy)
^
compilation aborted for arjen0.F90 (code 1)

I don’t get it… To my understanding, a !$omp workshare directive must have a corresponding !$omp end workshare, and everything in between is affected.

Agree… “pointer” and “non-contiguity” are 2 challenges for the optimizer.

Oops, you are absolutely right :unamused_face: . I had completely overlooked that. This is only one version of the program - I have others that use explicit loops or ASSOCIATE in stead of pointers.

Okay, I will have to revisit this one.

Hm, I thought I had seen a single !$omp workshare statement, but of course as the statement was not recognised at all as an OpenMP statement/directive/construct, the compiler did not complain. I should not do this sort of things in the evening while being distracted :frowning:

1 Like