Dear Frédéric,
Thank you for taking time and try to help me. I desperately need (your) help. I had some more updates but I feel like I’m running out of options. (To directly jump to my questions instead of reading this whole message, you can jump to “My remaining problems/questions” below)
I ran several tests since my last comment here. I kept using the same slurm script which I shared in my previous comment. I used various cpu allocations. This time I ran the test on my own case. This case is very similar to heat-exchanger tutorial but there is only one OpenFOAM and one CCX participant. CCX has 14k cells and OF has around 1M cells.
Since the solid participant has so few cells and not so large interface, I was expecting to see the executionTime and clockTime values in OpenFOAM logs to be close to one another. This was the case when I run the same simulation on my 6 core PC. However, there is a huge difference in between these values when I run the case on HPC. Please see the results in table below:
case |
Machine |
Coupling-Scheme |
network |
OF-decPar |
CCX-OMP |
Slrm-OF |
Slrm-CCX |
wait |
ExcTime |
ClckTime |
ExcT/ClckT |
0 |
PC_6c |
parallel-Implicit-IQN-ILS |
|
5 |
1 |
|
|
|
6310.56 |
6311 |
1.00 |
1 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
|
50 |
50 |
50 |
50 |
1 |
242.71 |
1687 |
6.95 |
2 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
|
50 |
50 |
64 |
64 |
1 |
256.05 |
1052 |
4.11 |
3 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
|
50 |
50 |
64 |
64 |
0 |
255.83 |
1043 |
4.08 |
4 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
|
50 |
1 |
64 |
64 |
1 |
222.01 |
1310 |
5.90 |
5 |
HPC_1n_128c |
parallel-Explicit |
|
50 |
50 |
64 |
64 |
1 |
204.75 |
1028 |
5.02 |
6 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
|
50 |
10 |
64 |
64 |
0 |
247.24 |
970 |
3.92 |
7 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
|
50 |
20 |
64 |
64 |
0 |
252.01 |
970 |
3.85 |
8 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
|
100 |
20 |
104 |
24 |
0 |
205.83 |
958 |
4.65 |
9 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
ib0 |
100 |
20 |
104 |
24 |
1 |
198.89 |
906 |
4.56 |
10 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
ib0 |
50 |
20 |
64 |
64 |
1 |
387.43 |
951 |
2.45 |
10.2 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
ib0 |
50 |
20 |
64 |
64 |
1 |
391.82 |
938 |
2.39 |
10.3 |
HPC_1n_128c |
parallel-Implicit-IQN-ILS |
ib0 |
50 |
20 |
64 |
64 |
1 |
384.15 |
934 |
2.39 |
(I believe that I didn’t do any mistake, but it is entirely possible that some of the values I reported in table above might be wrong.)
Here are the explanations about each column:
Network |
value of network in precice-config.xml |
OF-decPar |
numberOfSubdomains |
CCX-OMP |
OMP_NUM_THREADS |
Slrm-OF |
Number of cores allocated using taskset |
Slrm-CCX |
Number of cores allocated using taskset |
wait |
If wait command included in slurm script |
ExcTime |
Execution time of 1000th step in log.buoyantSimpleFoam |
ClckTime |
Clock time of 1000th step in log.buoyantSimpleFoam |
ExcT/ClckT |
Ratio of ExcTime to ClckTime (ideally it shall be close to 1) |
(I used the exact same precice-config.xml as in the heat-exchanger tutorial and set the name of interface boundaries, participants, meshes the same in my case, just to be on the safe side)
From these results I noticed:
- Communication+CCX takes more time than OF in HPC compared to my PC.
- Allocating more cores (using taskset) than numberOfSubdomains for both OF and CCX reduced the computation time.
- Since solid mesh is very small, changing the number of cores for CCX between 10 to 50 didn’t change the clock time much. Even setting it to 1 didn’t slowed it too much. (The main source of difference between execution and clock times is either poor communication or poor parallelization of CCX.)
- Best configuration to minimize clockTime obtained at case 8. (when network=ib0 not used)
- Defining network to infiniband (network=“ib0”) reduced clockTime by ~5%.
Around this time, I read your comment about using host files. I read the content on the link you suggested and tried to apply the instructions. I couldn’t manage to start a simulation on multiple nodes. I either received errors or simulation stuck where participants waiting for one another.
The main problem was (I think) ccx_precice can’t be run using mpirun. It requires two environment variables to be set to desired number of cores (instead of mpirun).
After conversing ideas with ChatGPT I tried changing the line:
<m2n:sockets connector="Fluid-Inner" acceptor="Solid" exchange-directory=".." />
To:
<m2n:sockets connector="Fluid-Inner" acceptor="Solid" exchange-directory=".." **network="ib0"** />
In precice-config.xml and ran the cases 9 to 10.3 with this option (using infiniband)
It improved the clocktime slightly but it seems OF takes more time to be solved. The executionTime is increased from 250 to 390 s, roughly. (check cases 7 vs 10*)
My remaining problems/questions:
- What is the correct way to run ccx_precice parallel in HPC? (I added my slurm script below) (it seems its not mpirun) (I’m using spooles but shall I install other libraries and use them instead? If so, how shall I proceed?) (Please don’t say that this question belongs to ccx forums)
- How shall I use hostfiles correctly, to run the case on multiple nodes or in single node?
- A different way to ask same questions: What is the best practice for running a coupled case using preCICE, OF, CCX on HPC using slurm scripts? (Both for using 1 node and multiple nodes.)
I’m getting frustrated please give me some guidance. (I added my slurm script below where I tried to use hostfiles)
Sincerely
Umut
My slurm script to use hostfiles:
#!/bin/bash
#SBATCH -A account
#SBATCH -n 256
#SBATCH -p queue
# ----------------------------------------------------------------------------
# Environment modules
# (Load everything you need for OpenFOAM, preCICE, CalculiX, etc.)
# ----------------------------------------------------------------------------
export PMIX_MCA_psec=none
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a
spack load openmpi@5.0.6 \
netlib-lapack@3.12.1 \
openblas@0.3.29 \
yaml-cpp@0.8.0 \
openfoam@2312 \
precice@3.1.2 \
petsc@3.22.3
export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH
# Make sure the libraries for preCICE, yaml-cpp, etc. are visible at runtime
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH
# ----------------------------------------------------------------------------
# 1) Clean / prepare the tutorial or case
# ----------------------------------------------------------------------------
caseFolder=$(pwd)
"$caseFolder/clean-tutorial.sh" > clean-tutorial.log
# ----------------------------------------------------------------------------
# 2) Generate hostfile(s) for OpenMPI
# We have exactly 2 allocated nodes, each with 128 cores.
# ----------------------------------------------------------------------------
rm -f hosts.ompi
for host in $(scontrol show hostname "$SLURM_JOB_NODELIST"); do
# For OpenMPI, the syntax is "host slots=N"
echo "${host} slots=128" >> hosts.ompi
done
# Now split the file so that:
# - 'hosts_openfoam' contains the first node (1 line)
# - 'hosts_ccx' contains the second node (1 line)
head -n 1 hosts.ompi > hosts_openfoam
tail -n 1 hosts.ompi > hosts_ccx
# Extract node names from hostfiles
OF_NODE=$(awk '{print $1}' hosts_openfoam)
CCX_NODE=$(awk '{print $1}' hosts_ccx)
# ----------------------------------------------------------------------------
# 3) Decompose fluid domain (OpenFOAM)
# ----------------------------------------------------------------------------
cd fluid
decomposePar -force > log.decomposePar
cd "$caseFolder"
# ----------------------------------------------------------------------------
# 4) Launch solvers in parallel
# - OpenFOAM: 128 MPI processes on node #1
# - CalculiX: 1 process (with 128 OMP threads) on node #2
#
# We use 'set -m' + subshell + background processes + 'wait'
# so both solvers run in parallel and the script waits for both.
# ----------------------------------------------------------------------------
set -m
(
# --- 4a) OpenFOAM (MPI, node #1) ---
cd fluid
mpirun -np 128 -hostfile ../hosts_openfoam buoyantSimpleFoam -parallel > log.buoyantSimpleFoam &
cd ..
echo "OpenFOAM is running on $OF_NODE"
# --- 2b) CalculiX (using srun) ---
echo "Starting CalculiX on $CCX_NODE"
srun --nodes=1 --ntasks=1 --nodelist="$CCX_NODE" \
bash -c "
cd solid
export OMP_NUM_THREADS=128
export CCX_NPROC_EQUATION_SOLVER=128
ccx_preCICE -i solid -precice-participant Solid > log.ccx
" &
echo "Solid run started\n"
wait
)
echo "All participants have completed."