Running CalculiX and OpenFOAM on HPC

Dear preCICE community,

I have made some progress but I’m still hoping to hear your opinion on the subject. (using taskset to bind processes to specific cores did significant improvement)

I ran some tests on heat-exchanger tutorial using the SLURM script in my previous comment, where I applied various CPU allocations. I ran all cases on a single node with 128 cores. Here are the results:

Notation for test setup with (precice, OF, ccx_precice):

  • NOC: Number of cores allocated for participants: Participant1-OF, Participant2-OF, Participant3-CCX
  • ET@TS4: ExecutionTime at time step #4 (from Participant1-OF log file)
  • First two cases are OpenFOAM only, where precice function commented out in controlDict
  • taskset: I changed the lines $caseFolder/fluid-inner-openfoam/run.sh -parallel &
    in SLURM script to taskset -c 0-49 ./run.sh -parallel &, hoping that it would bind processes to specific cores.

Results:

NOC: 1 			- ET@TS4: 21.37 s
NOC: 50 		- ET@TS4: 0.66 s
NOC: 1 1 1 		- ET@TS4: 22.17 s
NOC: 1 1 28 	- ET@TS4: 21.86 s
NOC: 50 50 28 	- ET@TS4: 285.1 s
NOC: 50 50 28 	- ET@TS4: 1.3 s (using taskset to bind processes to specific cores)
  • As expected, using 50 cores reduced the simulation time from 21.37 s to 0.66 s in the fluid-only case.
  • In coupled cases when running OF serial, results are similar to fluid-only case (~21s).
  • In coupled case, when OF run with -parallel option, simulation duration increases from 21 to 285 s. (Clearly something is wrong)
  • When used taskset to bind processes to specific cores, the execution time reduces to the expected range (1.3 s), though it is still slower than the fluid-only case (0.66 s) with the same number of cores.

I also attempted to run each participant in an individual SLURM script and submited 3 jobs (hoping that each would get its own node), but the simulation stalled—each participant ended up waiting for the others.

Here are my questions:

  1. Do these results indicate that I’m still not using an optimal SLURM script? (please check the script below)
  2. If the SLURM script is acceptable as is, how can I run coupled simulations on multiple nodes?

Sincerely,
Umut

SLURM script:

#!/bin/bash
#SBATCH -A account
#SBATCH -n 128
#SBATCH -p queue
# ------------------------------
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a
spack load openmpi@5.0.6 netlib-lapack@3.12.1 openblas@0.3.29 yaml-cpp@0.8.0 openfoam@2312 precice@3.1.2 petsc@3.22.3
export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH
# ------------------------------

caseFolder=$(pwd)
$caseFolder/clean-tutorial.sh > clean-tutorial.log

cd fluid-inner-openfoam
taskset -c 0-49 ./run.sh  -parallel &
cd ..

cd fluid-outer-openfoam
taskset -c 50-99 ./run.sh  -parallel &
cd ..

cd solid-calculix
taskset -c 100-127 ./run.sh &
cd ..

wait

Hi,

Glad to hear that you are making progress!
I’ll be able to answer you in more detail next week, but I strongly suspect that your initial slowdown is due to double subscription of nodes. We generally recommend using hostfiles to handle this. Have a look at this page:

Best
Frédéric

Dear Frédéric,

Thank you for taking time and try to help me. I desperately need (your) help. I had some more updates but I feel like I’m running out of options. (To directly jump to my questions instead of reading this whole message, you can jump to “My remaining problems/questions” below)

I ran several tests since my last comment here. I kept using the same slurm script which I shared in my previous comment. I used various cpu allocations. This time I ran the test on my own case. This case is very similar to heat-exchanger tutorial but there is only one OpenFOAM and one CCX participant. CCX has 14k cells and OF has around 1M cells.

Since the solid participant has so few cells and not so large interface, I was expecting to see the executionTime and clockTime values in OpenFOAM logs to be close to one another. This was the case when I run the same simulation on my 6 core PC. However, there is a huge difference in between these values when I run the case on HPC. Please see the results in table below:

case Machine Coupling-Scheme network OF-decPar CCX-OMP Slrm-OF Slrm-CCX wait ExcTime ClckTime ExcT/ClckT
0 PC_6c parallel-Implicit-IQN-ILS 5 1 6310.56 6311 1.00
1 HPC_1n_128c parallel-Implicit-IQN-ILS 50 50 50 50 1 242.71 1687 6.95
2 HPC_1n_128c parallel-Implicit-IQN-ILS 50 50 64 64 1 256.05 1052 4.11
3 HPC_1n_128c parallel-Implicit-IQN-ILS 50 50 64 64 0 255.83 1043 4.08
4 HPC_1n_128c parallel-Implicit-IQN-ILS 50 1 64 64 1 222.01 1310 5.90
5 HPC_1n_128c parallel-Explicit 50 50 64 64 1 204.75 1028 5.02
6 HPC_1n_128c parallel-Implicit-IQN-ILS 50 10 64 64 0 247.24 970 3.92
7 HPC_1n_128c parallel-Implicit-IQN-ILS 50 20 64 64 0 252.01 970 3.85
8 HPC_1n_128c parallel-Implicit-IQN-ILS 100 20 104 24 0 205.83 958 4.65
9 HPC_1n_128c parallel-Implicit-IQN-ILS ib0 100 20 104 24 1 198.89 906 4.56
10 HPC_1n_128c parallel-Implicit-IQN-ILS ib0 50 20 64 64 1 387.43 951 2.45
10.2 HPC_1n_128c parallel-Implicit-IQN-ILS ib0 50 20 64 64 1 391.82 938 2.39
10.3 HPC_1n_128c parallel-Implicit-IQN-ILS ib0 50 20 64 64 1 384.15 934 2.39

(I believe that I didn’t do any mistake, but it is entirely possible that some of the values I reported in table above might be wrong.)

Here are the explanations about each column:

Network value of network in precice-config.xml
OF-decPar numberOfSubdomains
CCX-OMP OMP_NUM_THREADS
Slrm-OF Number of cores allocated using taskset
Slrm-CCX Number of cores allocated using taskset
wait If wait command included in slurm script
ExcTime Execution time of 1000th step in log.buoyantSimpleFoam
ClckTime Clock time of 1000th step in log.buoyantSimpleFoam
ExcT/ClckT Ratio of ExcTime to ClckTime (ideally it shall be close to 1)

(I used the exact same precice-config.xml as in the heat-exchanger tutorial and set the name of interface boundaries, participants, meshes the same in my case, just to be on the safe side)

From these results I noticed:

  • Communication+CCX takes more time than OF in HPC compared to my PC.
  • Allocating more cores (using taskset) than numberOfSubdomains for both OF and CCX reduced the computation time.
  • Since solid mesh is very small, changing the number of cores for CCX between 10 to 50 didn’t change the clock time much. Even setting it to 1 didn’t slowed it too much. (The main source of difference between execution and clock times is either poor communication or poor parallelization of CCX.)
  • Best configuration to minimize clockTime obtained at case 8. (when network=ib0 not used)
  • Defining network to infiniband (network=“ib0”) reduced clockTime by ~5%.

Around this time, I read your comment about using host files. I read the content on the link you suggested and tried to apply the instructions. I couldn’t manage to start a simulation on multiple nodes. I either received errors or simulation stuck where participants waiting for one another.

The main problem was (I think) ccx_precice can’t be run using mpirun. It requires two environment variables to be set to desired number of cores (instead of mpirun).

After conversing ideas with ChatGPT I tried changing the line:
<m2n:sockets connector="Fluid-Inner" acceptor="Solid" exchange-directory=".." />
To:
<m2n:sockets connector="Fluid-Inner" acceptor="Solid" exchange-directory=".." **network="ib0"** />
In precice-config.xml and ran the cases 9 to 10.3 with this option (using infiniband)

It improved the clocktime slightly but it seems OF takes more time to be solved. The executionTime is increased from 250 to 390 s, roughly. (check cases 7 vs 10*)

My remaining problems/questions:

  1. What is the correct way to run ccx_precice parallel in HPC? (I added my slurm script below) (it seems its not mpirun) (I’m using spooles but shall I install other libraries and use them instead? If so, how shall I proceed?) (Please don’t say that this question belongs to ccx forums)
  2. How shall I use hostfiles correctly, to run the case on multiple nodes or in single node?
  3. A different way to ask same questions: What is the best practice for running a coupled case using preCICE, OF, CCX on HPC using slurm scripts? (Both for using 1 node and multiple nodes.)

I’m getting frustrated please give me some guidance. (I added my slurm script below where I tried to use hostfiles)

Sincerely

Umut

My slurm script to use hostfiles:

#!/bin/bash
#SBATCH -A account
#SBATCH -n 256
#SBATCH -p queue

# ----------------------------------------------------------------------------
# Environment modules
# (Load everything you need for OpenFOAM, preCICE, CalculiX, etc.)
# ----------------------------------------------------------------------------
export PMIX_MCA_psec=none
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a

spack load openmpi@5.0.6 \
           netlib-lapack@3.12.1 \
           openblas@0.3.29 \
           yaml-cpp@0.8.0 \
           openfoam@2312 \
           precice@3.1.2 \
           petsc@3.22.3

export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH

# Make sure the libraries for preCICE, yaml-cpp, etc. are visible at runtime
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH

# ----------------------------------------------------------------------------
# 1) Clean / prepare the tutorial or case
# ----------------------------------------------------------------------------
caseFolder=$(pwd)
"$caseFolder/clean-tutorial.sh" > clean-tutorial.log

# ----------------------------------------------------------------------------
# 2) Generate hostfile(s) for OpenMPI
#    We have exactly 2 allocated nodes, each with 128 cores.
# ----------------------------------------------------------------------------
rm -f hosts.ompi
for host in $(scontrol show hostname "$SLURM_JOB_NODELIST"); do
  # For OpenMPI, the syntax is "host slots=N"
  echo "${host} slots=128" >> hosts.ompi
done

# Now split the file so that:
#   - 'hosts_openfoam' contains the first node (1 line)
#   - 'hosts_ccx'      contains the second node (1 line)
head -n 1 hosts.ompi > hosts_openfoam
tail -n 1 hosts.ompi > hosts_ccx
# Extract node names from hostfiles
OF_NODE=$(awk '{print $1}' hosts_openfoam)
CCX_NODE=$(awk '{print $1}' hosts_ccx)

# ----------------------------------------------------------------------------
# 3) Decompose fluid domain (OpenFOAM)
# ----------------------------------------------------------------------------
cd fluid
decomposePar -force > log.decomposePar
cd "$caseFolder"

# ----------------------------------------------------------------------------
# 4) Launch solvers in parallel
#    - OpenFOAM: 128 MPI processes on node #1
#    - CalculiX: 1 process (with 128 OMP threads) on node #2
#
# We use 'set -m' + subshell + background processes + 'wait'
# so both solvers run in parallel and the script waits for both.
# ----------------------------------------------------------------------------
set -m

(
  # --- 4a) OpenFOAM (MPI, node #1) ---
  cd fluid
  mpirun -np 128 -hostfile ../hosts_openfoam buoyantSimpleFoam -parallel > log.buoyantSimpleFoam &
  cd ..
  echo "OpenFOAM is running on $OF_NODE"

  # --- 2b) CalculiX (using srun) ---
  echo "Starting CalculiX on $CCX_NODE"
  srun --nodes=1 --ntasks=1 --nodelist="$CCX_NODE" \
       bash -c "
         cd solid
         export OMP_NUM_THREADS=128
         export CCX_NPROC_EQUATION_SOLVER=128
         ccx_preCICE -i solid -precice-participant Solid > log.ccx
       " &

  echo "Solid run started\n"
  
  wait
)

echo "All participants have completed."

Hi,
I split the topic after the python problem disappeared.

CalculiX on HPC

The CCX adapter itself is not designed for being run in parallel with MPI. That said, it can be still launched on its own node using MPI mpirun -n 1 --hostfile=X.

To my understanding, CCX is best used on fat nodes of your cluster, using OpenMP threads to take advantage of the whole node.
If performance is a problematic, you can switch to their PastiX solver with optional CUDA support.

CacluliX discourse is probaly your best sources of information here. @mattfrei is there any information to be added?

CCX and Slurm session partitioning

and

Create a hostfile with the first node for CalculiX and a hostfile with the remaining nodes for OpenFOAM. In the documentation we use something like this:

head -1 hosts.ompi > hosts.ccx
tail +2 hosts.ompi > hosts.of

Then start one CCX with one rank using the hosts.ccx hostfile, and OpenFOAM with the other.

This is pretty much what you are doing right now.

This may be different on heterogenous clusters, and we have no experience with this so far.

It may be that you need to use job farming, by queuing one job of one node on the partition with the fat nodes, and another job for n nodes on the partition with the normal nodes. These jobs need to be launched together in order not to waste resources.

The LRZ, home the SuperMUC-NG, has some documentation on this subject.

In any way, it’s probably best to get hands-on time with your system admin to figure this out.

Communication cost in coupled simulations

It is always tricky to make claims about communication cost when coupling simulations using preCICE.

The communication cost in terms of pure transfer is generally not an issue.
The observed communication cost includes various waiting times and is heavily influenced by

  1. the used coupling scheme (especially serial)
  2. the amount of ranks per participant
  3. the load balance of your participant including the runtime of the coupled solver and data mapping schemes in preCICE.

This is why we developed profiling tools that give you a visual overview of all ranks and participants at the same time. The visual representation of these wait time is invaluable to localizing the problem at hand.

I recommend trimming your simulation with <max-time-windows ... /> and enabling <profiling mode="all" /> to get the full picture.

Hope that helps!

1 Like

I’m extremely delighted and grateful for the guidance I’ve received here, on https://precice.discourse.group/t/does-the-executable-binary-file-of-calculix-support-running-on-a-slurm-cluster-via-mpi/2336/6 and on the CCX forum: https://calculix.discourse.group/t/can-calculix-run-across-multiple-nodes/1316/6. Thank you all for your help. Below, I’ve summarized my situation and outlined my next steps.

Goal

I aim to transition from running my simulation on a 6-core CPU on my PC to an HPC system with 128-core nodes, targeting at least a 10x speed-up (using 20+ times more cores). However, so far, I’ve only achieved a 1x to 4x speed-up.

Case Details

I’m running a steady-state Conjugate Heat Transfer (CHT) case with radiation, involving one fluid and one solid participant. The coupling is handled using parallel-implicit mode with the same preCICE configuration as in the heat-exchanger tutorial.

For radiation modeling, I have two options:

  • fvDOM in OpenFOAM: After 4–5 timesteps, coupling iterations per timestep drop to 1 (almost like explicit coupling).
  • Cavity radiation in CCX: Requires ~10 coupling iterations per timestep but provides more reliable results.

Performance on HPC

  • On my PC (6 cores), the case runs successfully.
  • On HPC (128-core nodes), I expected a 10x speed-up when using 1–2 nodes (20–40x more cores).
  • However, results show:
    • fvDOM in OpenFOAM: ~4x speed-up.
    • Cavity radiation in CCX: <2x speed-up, despite a 20x increase in core count.

From OpenFOAM’s executionTime output, I see that OpenFOAM scales well (tested up to 100 cores). However, the overall simulation time does not decrease significantly, suggesting an issue with CCX or coupling.

My assuptions

  • If CCX is correctly configured (with Spooles, Pardiso, or PaStiX and a proper Slurm script), it should scale reasonably well up to ~100 cores in a single node using OpenMP, rather than just 4–8 cores.
  • If this is true, the issue could be:
    1. A bad Slurm script
    2. The need to switch solvers (from Spooles to Pardiso/PaStiX)

Next Steps

  1. Enable deeper profiling via adding lines toprecice-config.xml as @fsimonis suggested, to track communication and CCX execution time.
  2. Fix the Slurm script: Run CCX on a single node and OpenFOAM on another, avoiding synchronization issues. (hopefully I can fix the problem where simulation stuck at participants waiting for each other)
  3. Install PaStiX (Spack installation available).
  4. Install Pardiso.
  5. Test different CPU allocations and solvers (Spooles, Pardiso, PaStiX) on the HPC and compare performance results.

I have limited experience with HPC installations, Slurm scripts, and hostfiles, and I also have other responsibilities, so progress might be slow. However, I will share my findings here as I move forward.

Meanwhile, if anyone with experience in CCX on HPC has additional insight to share, I would greatly appreciate it.

Kind regards,
Umut

This sounds indeed like OpenFOAM outperforms CCX significantly and spends time waiting for data from CCX. This should be trivial to spot when checking the profiling information visually.


Can you verify that CalculiX uses the expected amount of OpenMP threads using OMP_DISPLAY_ENV? It is possible that OMP_NUM_THREADS this is not set in your favour. Also you may want to check the effect of OMP_PROC_BIND on CCX.


Let us know if you run into some strange behaviour when using the CCX adapter!


Something that could be interesting to you if you are feeling adventurous:
We are currently testing a change aiming to reduce lag in the socket implementation. Its motivation is to reduce performance degradation when using substeps="true" in exchange tags, but I am measuring a decent performance uplift for the current default of substeps="false".
We haven’t tested this on HPC yet, so if you find the time to try this change, it could be beneficial to both of us.

The change is small enough to be applied manually.

Dear all,

I’ve made the following progress so far:

  1. Loaded necessary modules, updated the ccx_precice Makefile, and recompiled to use PARDISO along with SPOOLES. (Installation was easier than I anticipated.)
  2. Considered installing PaStiX and MUMPS but prioritized other tasks.
  3. Conducted performance tests comparing PARDISO and SPOOLES.
  4. Added export OMP_DISPLAY_ENV=TRUE in the SLURM script to obtain more details on the ccx_precice execution.
  5. Enabled detailed profiling by adding <profiling mode="all" /> in the XML file, merged the JSON files, created trace.json, and uploaded it to ui.perfetto.dev to visualize the profiling results.
  6. Attempted to run the case using hostfiles but failed.

Switching to PARDISO reduced the clockTime by ~25%. However, it did not resolve the extreme difference between execution time and clock time reported in the OpenFOAM (OF) log file. This suggests that despite CCX handling fewer cells (~14k) compared to OF (~1M), CCX + communication takes 4 to 10 times longer than OF.

Current Issue: Running CCX & OpenFOAM on Separate Nodes

I suspect that correctly configuring my SLURM script to use hostfiles—running CCX on one node and OF on another—might resolve the issue. However, all my attempts have failed so far. Here are the methods I tested:

  1. Using taskset on single node (run.sh):
export OMP_NUM_THREADS=20
export CCX_NPROC_EQUATION_SOLVER=20
taskset -c 0-23 ccx_preCICE -i solid -precice-participant Solid > log.ccx &
  1. Using Hostfiles on 2 nodes (hostrun.sh):
set -m 
(
  cd solid
  export OMP_NUM_THREADS=128
  export CCX_NPROC_EQUATION_SOLVER=128
  echo "Starting CalculiX on node: $(awk '{print $1}' ../hosts_ccx)"
  mpirun -np 1 \
         -hostfile ../hosts_ccx \
         ccx_preCICE -i solid -precice-participant Solid \
         > log.ccx 2>&1
) &
  1. Using srun on 2 nodes (srun.sh):
set -m 
(
  cd solid
  echo "Launching CalculiX with 1 MPI rank + 128 threads..."
  export OMP_NUM_THREADS=128
  export CCX_NPROC_EQUATION_SOLVER=128
  srun --nodes=1 --ntasks=1 \
       ccx_preCICE -i solid -precice-participant Solid \
       > log.ccx 2>&1
) &

Both hostfile-based (mpirun) and srun approaches have failed. The log files and thrown error look like this:

#--------------------------------------- Slurm Output ---------------------------------------#
hosts_ccx:
a048 slots=128
hosts_of:
a085 slots=128
Starting OpenFOAM on nodes: a085
Starting CalculiX on node: a048
#--------------------------------------- OF Output ---------------------------------------#
Host key verification failed.
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-a048-107109@0,0] on node a048
  Remote daemon: [prterun-a048-107109@0,1] on node a085

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

What I Need Help With:

  1. SLURM Script Example
  • A proper SLURM script designed to run on an HPC system
  • Participants: ccx_precice and OpenFOAM
  • Uses hostfiles to ensure CCX and OF run on separate nodes
  1. Guidance
  • How should I modify my SLURM script to use hostfiles correctly?
  • Any insights into resolving the host key verification issue?

Additionally, while I’ve added OMP_DISPLAY_ENV and generated visual profiling results, I’m unsure how to properly evaluate these.

I’m attaching:

  • My SLURM script (hostrun.sh)
  • trace.json files (ran for 10 steps) and SLURM outputs for SPOOLES and PARDISO cases

At this point, I’m unsure what to try next and would greatly appreciate your input.

Looking forward to your suggestions!

Cheers,
Umut

Note @fsimonis: I am interested to try what you suggested about substeps="false" but prioritized other steps and couldn’t find the time yet.

Uploaded trace.json files and slurm outputs for SPOOLES and PARDISO cases:
Drive link for resuls and logs

Slurm Script (hostrun.sh):

#!/bin/bash
#SBATCH -A project
#SBATCH -n 256 # total cores = 2 nodes × 128 cores each
#SBATCH -p queue
#SBATCH -t 02:00:00

# ------------------------------------------------------------------------------
# 0) Environment modules. Load everything needed for your HPC environment, OpenFOAM, preCICE, CCX.
# ------------------------------------------------------------------------------
module purge
module load intel/oneAPI-compiler-2022.2.0
module load gcc/12.3.0
source ~/apps/spack/share/spack/setup-env.sh
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a
spack load openmpi@5.0.6 netlib-lapack@3.12.1 openblas@0.3.29 \
           yaml-cpp@0.8.0 openfoam@2312 precice@3.1.2 petsc@3.22.3

export PMIX_MCA_psec=none
export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/pastix-6.4.0-girmii3lhfr3zmevykyuc7wh4tn3diu7/lib:$LD_LIBRARY_PATH

echo "Case folder: $(pwd)"

# ------------------------------------------------------------------------------
# 1) Clean/prepare the case
# ------------------------------------------------------------------------------
caseFolder=$(pwd)
if [ -x "$caseFolder/clean-tutorial.sh" ]; then
  "$caseFolder/clean-tutorial.sh" > clean-tutorial.log
fi

# ------------------------------------------------------------------------------
# 2) Generate OpenMPI hostfiles
#    Slurm gives us two nodes for 256 total cores (128 cores/node).
# ------------------------------------------------------------------------------
rm -f hosts.ompi

# 2a) List the allocated hosts and create a single "hosts.ompi" file.
for host in $(scontrol show hostname "$SLURM_JOB_NODELIST"); do
  echo "${host} slots=128" >> hosts.ompi
done

# 2b) Now split:
#   - 'hosts_ccx' contains the FIRST node (1 line)
#   - 'hosts_of'  contains the REMAINING nodes
head -n 1 hosts.ompi > hosts_ccx
tail -n +2 hosts.ompi > hosts_of

# Let's see what we got:
echo "hosts_ccx:"
cat hosts_ccx
echo "hosts_of:"
cat hosts_of

# ------------------------------------------------------------------------------
# 3) Decompose the OpenFOAM case if you haven't done it beforehand.
# ------------------------------------------------------------------------------
cd fluid
decomposePar -force > log.decomposePar
cd ..

# ------------------------------------------------------------------------------
# 4) Launch Solvers in Parallel
#    - CCX: 1 MPI rank on the first node, but 128 OpenMP threads
#    - OpenFOAM: 128 MPI ranks on the other node(s)
# ------------------------------------------------------------------------------
set -m  # enable job control (so we can background processes & wait for them)

# 4a) Start CalculiX on the FIRST node (head -n 1 => hosts_ccx)
#     We use mpirun -np 1 --hostfile to place 1 MPI process on that node.
(
  cd solid
  export OMP_NUM_THREADS=128
  export CCX_NPROC_EQUATION_SOLVER=128
  echo "Starting CalculiX on node: $(awk '{print $1}' ../hosts_ccx)"
  
  mpirun -np 1 \
         -hostfile ../hosts_ccx \
         ccx_preCICE -i solid -precice-participant Solid \
         > log.ccx 2>&1
) &

# 4b) Start OpenFOAM on the OTHER node(s) (tail -n +2 => hosts_of)
(
  cd fluid
  echo "Starting OpenFOAM on nodes: $(awk '{print $1}' ../hosts_of)"
  
  mpirun -np 128 \
         -hostfile ../hosts_of \
         buoyantSimpleFoam -parallel \
         > log.buoyantSimpleFoam 2>&1
) &

# 4c) Wait for BOTH background processes to complete
wait

echo "All participants have completed."

Hello @fsimonis,

I’ve recently completed a new case study on running a coupled simulation using ccx_precice and OpenFOAM. To eliminate human error, I automated every step by using scripts—from case creation and job submission to post‑processing (including log parsing and result collection). I ensured that there is no manual editing; every case is generated and processed automatically (and I even ran the whole set twice for consistency).

For this study, I used the cavity radiation model in CalculiX (which gave more reliable results despite requiring more iterations per time step) and ran all cases with PARDISO. In total, I ran nine cases, each for 50 time steps. (all tests are conducted on 128 core/node same hardware and same queue)

In the base case (test.03), I allocated 64 cores for both OpenFOAM and CalculiX on a single node (using taskset to assign cores) with network set to ib0, and both solvers were launched in the background (with a final wait command). I did not set any value for OMP_PROC_BIND in that case. I then compared the executionTime and clockTime reported in the OpenFOAM log at the 50th time step. Below is a summary of my results:

Test OF #procs OF #cores CCX #procs CCX #cores OMP_PROC_BIND Other ExcTime (1st Run) ClckTime (1st Run) ExcTime (2nd Run) ClckTime (2nd Run)
1 64 64 64 64 no ib0 281.03 299
2 64 64 64 64 no wait 18.09 302 14.63 300
3 64 64 64 64 21.68 295
4 100 100 28 28 32.65 279 17.68 286
5 100 104 20 24 258.54 273
6 100 100 20 20 271.56 294 259.81 274
7 64 64 64 64 spread 14.73 2092 9.75 2073
8 64 64 64 64 close 2059.52 2076 9.78 2059
9 64 64 64 64 master 2054.95 2071

My observations:

  1. Some cases fail to start randomly (e.g., tests 1 and 3 in the first run; tests 5 and 9 in the second run). (See error message below.)
  2. When I set OMP_PROC_BIND (tests 7, 8, 9), the simulations run about 7.5 times slower, regardless of whether I use “spread,” “close,” or “master.”
  3. Execution times vary substantially even when clockTime remains nearly constant. For instance, in test 8, executionTime dropped from around 2000 to 10 seconds without a corresponding change in clockTime.
  4. Changing core allocations (tests 3, 4, 5, 6) produced little change in clockTime (ranging from 273 to 295 s).

Error note in slurm output:

--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[51280,1],69]
  Exit code:    255
--------------------------------------------------------------------------

I suspect this may be related to running both CCX and OpenFOAM on a single node with taskset. I wonder if running CCX and OpenFOAM on separate nodes might improve stability and clarify the issue. I’ve attached the full slurm output and my slurm script at the end of this post.

My questions:

  1. Have you (or anyone you know) run ccx_precice and OpenFOAM on separate nodes? If so, could you please share a tested complete slurm script?
  2. What is your interpretation of these results? What might be causing the random startup failures and the performance discrepancies?

I appreciate all your help so far and hope these additional details are useful. Thank you very much.

Sincerely,
Umut

Slurm script of test.07:

#!/bin/bash
#SBATCH -A project
#SBATCH -n 128
#SBATCH -p queue


# --- 1) LOAD ENVIRONMENT --- #
module purge
module load intel/oneAPI-compiler-2022.2.0
module load gcc/12.3.0
source ~/apps/spack/share/spack/setup-env.sh
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a
spack load openmpi@5.0.6 netlib-lapack@3.12.1 openblas@0.3.29 \
           yaml-cpp@0.8.0 openfoam@2312 precice@3.1.2 petsc@3.22.3

export PMIX_MCA_psec=none
export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/pastix-6.4.0-girmii3lhfr3zmevykyuc7wh4tn3diu7/lib:$LD_LIBRARY_PATH

# --- 2) CLEAN CASE & Decompose --- #
caseFolder=$(pwd)
# "$caseFolder/clean-tutorial.sh" > clean-tutorial.log
cd $caseFolder/fluid
decomposePar -force > log.decomposePar
cd $caseFolder


# --- 3) RUN CALCULIX --- #
cd solid
echo "Before Run:"
echo "OMP_PROC_BIND is $OMP_PROC_BIND"
echo "OMP_NUM_THREADS is $OMP_NUM_THREADS"
echo "OMP_DISPLAY_ENV is $OMP_DISPLAY_ENV"

export OMP_DISPLAY_ENV=TRUE
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=64
export CCX_NPROC_EQUATION_SOLVER=64

taskset -c 0-63 ccx_preCICE -i solid -precice-participant Solid > log.ccx &

echo "After Run:"
echo "OMP_NUM_THREADS is $OMP_NUM_THREADS"
echo "OMP_PROC_BIND is $OMP_PROC_BIND"
echo "OMP_DISPLAY_ENV is $OMP_DISPLAY_ENV"
cd ..

# --- 4) RUN OpenFOAM --- #
cd fluid
taskset -c 64-127 mpirun -np 64 buoyantSimpleFoam -parallel > log.buoyantSimpleFoam &
cd ..

# --- 5) WAIT FOR BOTH PROCESSES --- #
wait

Complete slurm output (Test.05 2nd run):

Before Run:
OMP_PROC_BIND is
OMP_NUM_THREADS is
OMP_PROC_BIND is
OMP_DISPLAY_ENV is
After Run:
OMP_NUM_THREADS is 20
OMP_PROC_BIND is
OMP_DISPLAY_ENV is TRUE

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '20'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '1'
  OMP_NUM_TEAMS = '0'
  OMP_TEAMS_THREAD_LIMIT = '0'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  OMP_ALLOCATOR = 'omp_default_mem_alloc'
  OMP_TARGET_OFFLOAD = 'DEFAULT'
OPENMP DISPLAY ENVIRONMENT END

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='1'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED: deprecated; max-active-levels-var=1
  [host] OMP_NUM_TEAMS='0'
  [host] OMP_NUM_THREADS='20'
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_TEAMS_THREAD_LIMIT='0'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_TOOL_VERBOSE_INIT: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51280,1],69]
  Exit code:    255
--------------------------------------------------------------------------
slurmstepd: error: *** JOB 286932 ON a038 CANCELLED AT 2025-03-05T16:04:00 ***

===== SLURM IS ISTATISTIKLERI (JOB STATISTICS) ========================
Job ID: 
Cluster: 
User/Group: 
State: CANCELLED (exit code 0)
Nodes: 1
Cores per node: 128
CPU Utilized: 00:28:28
CPU Efficiency: 0.83% of 2-09:18:56 core-walltime
Job Wall-clock time: 00:26:52
Memory Utilized: 9.24 GB
Memory Efficiency: 3.79% of 244.14 GB

=======================================================================

There is a slurm feature called heterogeneous jobs, which seems to be the modern and correct way of simultaneously starting jobs with different requirements.

I will be looking into this as this may simplify our documentation on SLURM sessions. It also opens the doors to simpler scheduling to GPU and fat nodes for example.

This strongly suggests a problem with the setup of the cluster. I highly recommend reporting this to your system admins. You could try to remove ~/.ssh/known_hosts, but I doubt this will be sufficient in this case.

The most important is to check if CalculiX uses the correct amount of threads. Your provided spooles run for example uses 20 threads as idicated by the following line in the output:

 [host] OMP_NUM_THREADS='20'

Generally interesting is that the first OpenFoam time window is slow and subsequent time windows are fast.
Meaning, there are 3 phases to your simulation:

  1. participant initialization/startup time, which should stay pretty much the same for this setup
  2. first time window, which should get worse with OF problem size
  3. later time windows, which should remain stable as CalculiX is the bottleneck

I’ll have a look at 3., which is the most important in long-running simulations.

Spooles case

Here you can clearly see that the simulation is bottlenecked by CalculiX.
The OpenFOAM solvers spends the majority of its time waiting for data from CalculiX in advance/m2n.receiveData.

Paradiso case

This one is looking way better. CalculiX is significantly faster here, but still the bottleneck.
Ideally for your parallel scheme, the solver.advance of both solvers should be the same. Here CalculiX runs at around 500ms and OpenFoam at around 200ms.

This is an interesting observation. I’ll keep this in mind.

This looks like some massive hangup though. Could be the file system. Not 100% sure if this helps, but you could try calling sync between your benchmarks.

Your script looks pretty much exactly what we would expect. I may get around to testing this on the SuperMUC-NG. I’ll let you know if I can share some insights.

My interpretation is

  1. the case is still bottlenecked by CalculiX
  2. there may be some hang times due to filesystem latency, which can be challenging to debug (experience talking)
  3. inspecting the overall time includes start-up, initialization and other wait times. These are generally negligible in real simulations that run for thousands of time steps. For 10 time
  4. inspecting the visual output of the trace files or the ratio between solver.advance and advance per participant using precice-profiling analyze may be a more effective way to assessing the efficiency of your simulation.

As CalculiX is your main bottleneck, it may be worth trying a few things (untested):

  • trying the GPU solver
  • upgrading your compilers and linker as far as possible. You are already missing out on 3 years of compiler optimizations, especially for newer architecture. Maybe even clang or newer intel compilers.
  • Trying to activate link time optimization for the CalculiX sources.
1 Like