A general remark regarding performance (serial and parallel): performance is not portable.
One important question is therefore the used hardware. Maybe is not aware of the processor layout, e.g. because it is an AMD CPU. There are also openMP options to pin threads to certain cores.