Slower performance on cluster than local machine

I have been using precice for coupling two Lattice Boltzmann solvers. I encountered a weird issue. When I run the coupled solvers in paralell on my local machine using 8 cores ( 4 for each solver) the running time is 4 times faster than the time I run on a cluster. In order to compare them I have been running the code on a single node and used “lo” network on the cluster. Even when I increase the number the cores I see no change in the speed. Could you please help with what could be the reason for that?

Hi @Moe_dae :wave:

How do you start your simulation on the cluster? Pinning could be one reason. Could you attach your job script?
Did you build preCICE in Release mode?

How does the cluster hardware compare to your local machine?

And please always attach your preCICE configuration :slight_smile:

Benjamin

Hello Benjamin,

Thanks for your response.
On the cluster I built preCICE from source and used the default option. So I think it sould be release mode.
Following is the job script I used for running the code on the Cluster ( Digital Research Alliance of Canada).

#!/bin/bash 
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=64G

rm hosts.intel 
rm hosts.ompi
for host in `scontrol show hostname $SLURM_JOB_NODELIST`; do
  echo $host >> hosts.intel;
  for j in $(seq 1 ${SLURM_TASKS_PER_NODE%%(*}); do
    echo $host >> hosts.ompi;
  done
done

head -4 hosts.ompi > Coarse/hosts.a
tail +5 hosts.ompi > Fine/hosts.b

set -m
(
  cd Coarse/
  
  mpirun -n 4 -hostfile hosts.a ./Coarse &
  
  cd ../Fine/
  mpirun -n 4 -hostfile hosts.b ./Fine &

  wait
)
echo "All participants succeeded"

In terms of the hardware:
Cluster:2 x Intel E5-2683 v4 Broadwell @ 2.1GHz
My local Machine: Model name: 11th Gen Intel(R) Core™ i9-11900 @ 2.50GHz

Following is the config file for the preCICE:

<solver-interface dimensions="3">
    <data:vector name="Pop_Coarse00" />
    <data:vector name="Pop_Coarse00_eq" />
  
    <data:vector name="Pop_Fine00" />
    <data:vector name="Pop_Fine00_eq" />


    <mesh name="Mesh_Coarse">
      <use-data name="Pop_Coarse00" />
      <use-data name="Pop_Coarse00_eq"
 
      <use-data name="Pop_Fine00" />
      <use-data name="Pop_Fine00_eq" />
      
    </mesh>

    <mesh name="Mesh_Fine">
      <use-data name="Pop_Fine00" />
      <use-data name="Pop_Fine00_eq" />
     
      <use-data name="Pop_Coarse00" />
      <use-data name="Pop_Coarse00_eq" />

    </mesh>

      <participant name="cavity3d">
      <use-mesh name="Mesh_Coarse" provide="yes" /> 

      <write-data name="Pop_Coarse00" mesh="Mesh_Coarse" />
      <write-data name="Pop_Coarse00_eq" mesh="Mesh_Coarse" />
      
      <read-data name="Pop_Fine00" mesh="Mesh_Coarse" />
      <read-data name="Pop_Fine00_eq" mesh="Mesh_Coarse" />
       
    </participant>
    <participant name="Boundary">
      <use-mesh name="Mesh_Fine" provide="yes" />
      <use-mesh name="Mesh_Coarse" from="cavity3d"/> 
       <mapping:nearest-neighbor
        direction="read"
        from="Mesh_Coarse"
        to="Mesh_Fine"
        constraint="consistent" />
      <mapping:nearest-neighbor
        direction="write"
        from="Mesh_Fine"
        to="Mesh_Coarse"
        constraint="consistent" />  
      <write-data name="Pop_Fine00" mesh="Mesh_Fine" />
      <write-data name="Pop_Fine00_eq" mesh="Mesh_Fine" />
   
      <read-data name="Pop_Coarse00" mesh="Mesh_Fine" />
      <read-data name="Pop_Coarse00_eq" mesh="Mesh_Fine" />
    
    </participant>

      <m2n:sockets  from="Boundary" to="cavity3d" />
    	<coupling-scheme:parallel-explicit>
  	<participants first="cavity3d" second="Boundary"/>
  	  <max-time-windows value="20000000"/>
 	 <time-window-size value="0.032"/>
               
      <exchange data="Pop_Fine00" mesh="Mesh_Coarse" from="Boundary" to="cavity3d" />   
      <exchange data="Pop_Fine00_eq" mesh="Mesh_Coarse" from="Boundary" to="cavity3d" />
      
      <exchange data="Pop_Coarse00" mesh="Mesh_Coarse" from="cavity3d" to="Boundary"/>
      <exchange data="Pop_Coarse00_eq" mesh="Mesh_Coarse" from="cavity3d" to="Boundary"/>
    </coupling-scheme:parallel-explicit>
  </solver-interface>
</precice-configuration>

I need to mention that in my code all the writing data are transferred to one rank and preCICE is called only on that one specific rank. This could for sure make some latency in running. I am right now changing the code in a way that preCICE is called on all of the ranks and so will remove the data transfer between the ranks. But still the point is that the code is faster on my local machine with the same data transfer config.

Thanks a lot
Moe

I am not sure why you can not see the whole config file. I have coppied the whole file.

No, the default is Debug.

https://precice.org/installation-source-configuration.html#options

Release should be much faster.

Pinning looks good.

1 Like