Slower performance on cluster than local machine

Moe_dae · May 28, 2023, 7:21pm

I have been using precice for coupling two Lattice Boltzmann solvers. I encountered a weird issue. When I run the coupled solvers in paralell on my local machine using 8 cores ( 4 for each solver) the running time is 4 times faster than the time I run on a cluster. In order to compare them I have been running the code on a single node and used “lo” network on the cluster. Even when I increase the number the cores I see no change in the speed. Could you please help with what could be the reason for that?

uekerman · May 29, 2023, 7:24am

Hi @Moe_dae

How do you start your simulation on the cluster? Pinning could be one reason. Could you attach your job script?
Did you build preCICE in Release mode?

How does the cluster hardware compare to your local machine?

And please always attach your preCICE configuration

Benjamin

Moe_dae · May 29, 2023, 5:34pm

Hello Benjamin,

Thanks for your response.
On the cluster I built preCICE from source and used the default option. So I think it sould be release mode.
Following is the job script I used for running the code on the Cluster ( Digital Research Alliance of Canada).

#!/bin/bash 
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=64G

rm hosts.intel 
rm hosts.ompi
for host in `scontrol show hostname $SLURM_JOB_NODELIST`; do
  echo $host >> hosts.intel;
  for j in $(seq 1 ${SLURM_TASKS_PER_NODE%%(*}); do
    echo $host >> hosts.ompi;
  done
done

head -4 hosts.ompi > Coarse/hosts.a
tail +5 hosts.ompi > Fine/hosts.b

set -m
(
  cd Coarse/
  
  mpirun -n 4 -hostfile hosts.a ./Coarse &
  
  cd ../Fine/
  mpirun -n 4 -hostfile hosts.b ./Fine &

  wait
)
echo "All participants succeeded"

In terms of the hardware:
Cluster:2 x Intel E5-2683 v4 Broadwell @ 2.1GHz
My local Machine: Model name: 11th Gen Intel(R) Core™ i9-11900 @ 2.50GHz

Following is the config file for the preCICE:

<solver-interface dimensions="3">
    <data:vector name="Pop_Coarse00" />
    <data:vector name="Pop_Coarse00_eq" />
  
    <data:vector name="Pop_Fine00" />
    <data:vector name="Pop_Fine00_eq" />


    <mesh name="Mesh_Coarse">
      <use-data name="Pop_Coarse00" />
      <use-data name="Pop_Coarse00_eq"
 
      <use-data name="Pop_Fine00" />
      <use-data name="Pop_Fine00_eq" />
      
    </mesh>

    <mesh name="Mesh_Fine">
      <use-data name="Pop_Fine00" />
      <use-data name="Pop_Fine00_eq" />
     
      <use-data name="Pop_Coarse00" />
      <use-data name="Pop_Coarse00_eq" />

    </mesh>

      <participant name="cavity3d">
      <use-mesh name="Mesh_Coarse" provide="yes" /> 

      <write-data name="Pop_Coarse00" mesh="Mesh_Coarse" />
      <write-data name="Pop_Coarse00_eq" mesh="Mesh_Coarse" />
      
      <read-data name="Pop_Fine00" mesh="Mesh_Coarse" />
      <read-data name="Pop_Fine00_eq" mesh="Mesh_Coarse" />
       
    </participant>
    <participant name="Boundary">
      <use-mesh name="Mesh_Fine" provide="yes" />
      <use-mesh name="Mesh_Coarse" from="cavity3d"/> 
       <mapping:nearest-neighbor
        direction="read"
        from="Mesh_Coarse"
        to="Mesh_Fine"
        constraint="consistent" />
      <mapping:nearest-neighbor
        direction="write"
        from="Mesh_Fine"
        to="Mesh_Coarse"
        constraint="consistent" />  
      <write-data name="Pop_Fine00" mesh="Mesh_Fine" />
      <write-data name="Pop_Fine00_eq" mesh="Mesh_Fine" />
   
      <read-data name="Pop_Coarse00" mesh="Mesh_Fine" />
      <read-data name="Pop_Coarse00_eq" mesh="Mesh_Fine" />
    
    </participant>

      <m2n:sockets  from="Boundary" to="cavity3d" />
    	<coupling-scheme:parallel-explicit>
  	<participants first="cavity3d" second="Boundary"/>
  	  <max-time-windows value="20000000"/>
 	 <time-window-size value="0.032"/>
               
      <exchange data="Pop_Fine00" mesh="Mesh_Coarse" from="Boundary" to="cavity3d" />   
      <exchange data="Pop_Fine00_eq" mesh="Mesh_Coarse" from="Boundary" to="cavity3d" />
      
      <exchange data="Pop_Coarse00" mesh="Mesh_Coarse" from="cavity3d" to="Boundary"/>
      <exchange data="Pop_Coarse00_eq" mesh="Mesh_Coarse" from="cavity3d" to="Boundary"/>
    </coupling-scheme:parallel-explicit>
  </solver-interface>
</precice-configuration>

I need to mention that in my code all the writing data are transferred to one rank and preCICE is called only on that one specific rank. This could for sure make some latency in running. I am right now changing the code in a way that preCICE is called on all of the ranks and so will remove the data transfer between the ranks. But still the point is that the code is faster on my local machine with the same data transfer config.

Thanks a lot
Moe

Moe_dae · May 29, 2023, 5:41pm

I am not sure why you can not see the whole config file. I have coppied the whole file.

uekerman · May 30, 2023, 6:35am

No, the default is Debug.

https://precice.org/installation-source-configuration.html#options

Release should be much faster.

Pinning looks good.

Topic		Replies	Views
A case can run normally on the local machine, but stuck on the cluster during the precice initialization phase Using preCICE openfoam , data-mapping , communication , calculix , fsi	6	56	March 23, 2025
Running preCICE on a Cluster Using preCICE mpi , slurm	18	1139	August 28, 2021
Platform-dependence of implicit (iterative) coupling Using preCICE coupling-schemes	3	551	April 27, 2020
Solver load balancing Using preCICE performance , hpc	1	407	May 19, 2020
Segmentation error using precice in local machine Using preCICE	3	354	April 26, 2023

Slower performance on cluster than local machine

Related topics