Running CalculiX and OpenFOAM on HPC

Umut · February 11, 2025, 3:34pm

Dear preCICE community,

I have made some progress but I’m still hoping to hear your opinion on the subject. (using taskset to bind processes to specific cores did significant improvement)

I ran some tests on heat-exchanger tutorial using the SLURM script in my previous comment, where I applied various CPU allocations. I ran all cases on a single node with 128 cores. Here are the results:

Notation for test setup with (precice, OF, ccx_precice):

NOC: Number of cores allocated for participants: Participant1-OF, Participant2-OF, Participant3-CCX
ET@TS4: ExecutionTime at time step #4 (from Participant1-OF log file)
First two cases are OpenFOAM only, where precice function commented out in controlDict
taskset: I changed the lines $caseFolder/fluid-inner-openfoam/run.sh -parallel &
in SLURM script to taskset -c 0-49 ./run.sh -parallel &, hoping that it would bind processes to specific cores.

Results:

NOC: 1 			- ET@TS4: 21.37 s
NOC: 50 		- ET@TS4: 0.66 s
NOC: 1 1 1 		- ET@TS4: 22.17 s
NOC: 1 1 28 	- ET@TS4: 21.86 s
NOC: 50 50 28 	- ET@TS4: 285.1 s
NOC: 50 50 28 	- ET@TS4: 1.3 s (using taskset to bind processes to specific cores)

As expected, using 50 cores reduced the simulation time from 21.37 s to 0.66 s in the fluid-only case.
In coupled cases when running OF serial, results are similar to fluid-only case (~21s).
In coupled case, when OF run with -parallel option, simulation duration increases from 21 to 285 s. (Clearly something is wrong)
When used taskset to bind processes to specific cores, the execution time reduces to the expected range (1.3 s), though it is still slower than the fluid-only case (0.66 s) with the same number of cores.

I also attempted to run each participant in an individual SLURM script and submited 3 jobs (hoping that each would get its own node), but the simulation stalled—each participant ended up waiting for the others.

Here are my questions:

Do these results indicate that I’m still not using an optimal SLURM script? (please check the script below)
If the SLURM script is acceptable as is, how can I run coupled simulations on multiple nodes?

Sincerely,
Umut

SLURM script:

#!/bin/bash
#SBATCH -A account
#SBATCH -n 128
#SBATCH -p queue
# ------------------------------
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a
spack load openmpi@5.0.6 netlib-lapack@3.12.1 openblas@0.3.29 yaml-cpp@0.8.0 openfoam@2312 precice@3.1.2 petsc@3.22.3
export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH
# ------------------------------

caseFolder=$(pwd)
$caseFolder/clean-tutorial.sh > clean-tutorial.log

cd fluid-inner-openfoam
taskset -c 0-49 ./run.sh  -parallel &
cd ..

cd fluid-outer-openfoam
taskset -c 50-99 ./run.sh  -parallel &
cd ..

cd solid-calculix
taskset -c 100-127 ./run.sh &
cd ..

wait

fsimonis · February 14, 2025, 8:40pm

Hi,

Glad to hear that you are making progress!
I’ll be able to answer you in more detail next week, but I strongly suspect that your initial slowdown is due to double subscription of nodes. We generally recommend using hostfiles to handle this. Have a look at this page:

Best
Frédéric

Umut · February 24, 2025, 1:04pm

Dear Frédéric,

Thank you for taking time and try to help me. I desperately need (your) help. I had some more updates but I feel like I’m running out of options. (To directly jump to my questions instead of reading this whole message, you can jump to “My remaining problems/questions” below)

I ran several tests since my last comment here. I kept using the same slurm script which I shared in my previous comment. I used various cpu allocations. This time I ran the test on my own case. This case is very similar to heat-exchanger tutorial but there is only one OpenFOAM and one CCX participant. CCX has 14k cells and OF has around 1M cells.

Since the solid participant has so few cells and not so large interface, I was expecting to see the executionTime and clockTime values in OpenFOAM logs to be close to one another. This was the case when I run the same simulation on my 6 core PC. However, there is a huge difference in between these values when I run the case on HPC. Please see the results in table below:

case	Machine	Coupling-Scheme	network	OF-decPar	CCX-OMP	Slrm-OF	Slrm-CCX	wait	ExcTime	ClckTime	ExcT/ClckT
0	PC_6c	parallel-Implicit-IQN-ILS		5	1				6310.56	6311	1.00
1	HPC_1n_128c	parallel-Implicit-IQN-ILS		50	50	50	50	1	242.71	1687	6.95
2	HPC_1n_128c	parallel-Implicit-IQN-ILS		50	50	64	64	1	256.05	1052	4.11
3	HPC_1n_128c	parallel-Implicit-IQN-ILS		50	50	64	64	0	255.83	1043	4.08
4	HPC_1n_128c	parallel-Implicit-IQN-ILS		50	1	64	64	1	222.01	1310	5.90
5	HPC_1n_128c	parallel-Explicit		50	50	64	64	1	204.75	1028	5.02
6	HPC_1n_128c	parallel-Implicit-IQN-ILS		50	10	64	64	0	247.24	970	3.92
7	HPC_1n_128c	parallel-Implicit-IQN-ILS		50	20	64	64	0	252.01	970	3.85
8	HPC_1n_128c	parallel-Implicit-IQN-ILS		100	20	104	24	0	205.83	958	4.65
9	HPC_1n_128c	parallel-Implicit-IQN-ILS	ib0	100	20	104	24	1	198.89	906	4.56
10	HPC_1n_128c	parallel-Implicit-IQN-ILS	ib0	50	20	64	64	1	387.43	951	2.45
10.2	HPC_1n_128c	parallel-Implicit-IQN-ILS	ib0	50	20	64	64	1	391.82	938	2.39
10.3	HPC_1n_128c	parallel-Implicit-IQN-ILS	ib0	50	20	64	64	1	384.15	934	2.39

(I believe that I didn’t do any mistake, but it is entirely possible that some of the values I reported in table above might be wrong.)

Here are the explanations about each column:

Network	value of network in precice-config.xml
OF-decPar	numberOfSubdomains
CCX-OMP	OMP_NUM_THREADS
Slrm-OF	Number of cores allocated using taskset
Slrm-CCX	Number of cores allocated using taskset
wait	If wait command included in slurm script
ExcTime	Execution time of 1000th step in log.buoyantSimpleFoam
ClckTime	Clock time of 1000th step in log.buoyantSimpleFoam
ExcT/ClckT	Ratio of ExcTime to ClckTime (ideally it shall be close to 1)

(I used the exact same precice-config.xml as in the heat-exchanger tutorial and set the name of interface boundaries, participants, meshes the same in my case, just to be on the safe side)

From these results I noticed:

Communication+CCX takes more time than OF in HPC compared to my PC.
Allocating more cores (using taskset) than numberOfSubdomains for both OF and CCX reduced the computation time.
Since solid mesh is very small, changing the number of cores for CCX between 10 to 50 didn’t change the clock time much. Even setting it to 1 didn’t slowed it too much. (The main source of difference between execution and clock times is either poor communication or poor parallelization of CCX.)
Best configuration to minimize clockTime obtained at case 8. (when network=ib0 not used)
Defining network to infiniband (network=“ib0”) reduced clockTime by ~5%.

Around this time, I read your comment about using host files. I read the content on the link you suggested and tried to apply the instructions. I couldn’t manage to start a simulation on multiple nodes. I either received errors or simulation stuck where participants waiting for one another.

The main problem was (I think) ccx_precice can’t be run using mpirun. It requires two environment variables to be set to desired number of cores (instead of mpirun).

After conversing ideas with ChatGPT I tried changing the line:
<m2n:sockets connector="Fluid-Inner" acceptor="Solid" exchange-directory=".." />
To:
<m2n:sockets connector="Fluid-Inner" acceptor="Solid" exchange-directory=".." **network="ib0"** />
In precice-config.xml and ran the cases 9 to 10.3 with this option (using infiniband)

It improved the clocktime slightly but it seems OF takes more time to be solved. The executionTime is increased from 250 to 390 s, roughly. (check cases 7 vs 10*)

My remaining problems/questions:

What is the correct way to run ccx_precice parallel in HPC? (I added my slurm script below) (it seems its not mpirun) (I’m using spooles but shall I install other libraries and use them instead? If so, how shall I proceed?) (Please don’t say that this question belongs to ccx forums)
How shall I use hostfiles correctly, to run the case on multiple nodes or in single node?
A different way to ask same questions: What is the best practice for running a coupled case using preCICE, OF, CCX on HPC using slurm scripts? (Both for using 1 node and multiple nodes.)

I’m getting frustrated please give me some guidance. (I added my slurm script below where I tried to use hostfiles)

Sincerely

Umut

My slurm script to use hostfiles:

#!/bin/bash
#SBATCH -A account
#SBATCH -n 256
#SBATCH -p queue

# ----------------------------------------------------------------------------
# Environment modules
# (Load everything you need for OpenFOAM, preCICE, CalculiX, etc.)
# ----------------------------------------------------------------------------
export PMIX_MCA_psec=none
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a

spack load openmpi@5.0.6 \
           netlib-lapack@3.12.1 \
           openblas@0.3.29 \
           yaml-cpp@0.8.0 \
           openfoam@2312 \
           precice@3.1.2 \
           petsc@3.22.3

export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH

# Make sure the libraries for preCICE, yaml-cpp, etc. are visible at runtime
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH

# ----------------------------------------------------------------------------
# 1) Clean / prepare the tutorial or case
# ----------------------------------------------------------------------------
caseFolder=$(pwd)
"$caseFolder/clean-tutorial.sh" > clean-tutorial.log

# ----------------------------------------------------------------------------
# 2) Generate hostfile(s) for OpenMPI
#    We have exactly 2 allocated nodes, each with 128 cores.
# ----------------------------------------------------------------------------
rm -f hosts.ompi
for host in $(scontrol show hostname "$SLURM_JOB_NODELIST"); do
  # For OpenMPI, the syntax is "host slots=N"
  echo "${host} slots=128" >> hosts.ompi
done

# Now split the file so that:
#   - 'hosts_openfoam' contains the first node (1 line)
#   - 'hosts_ccx'      contains the second node (1 line)
head -n 1 hosts.ompi > hosts_openfoam
tail -n 1 hosts.ompi > hosts_ccx
# Extract node names from hostfiles
OF_NODE=$(awk '{print $1}' hosts_openfoam)
CCX_NODE=$(awk '{print $1}' hosts_ccx)

# ----------------------------------------------------------------------------
# 3) Decompose fluid domain (OpenFOAM)
# ----------------------------------------------------------------------------
cd fluid
decomposePar -force > log.decomposePar
cd "$caseFolder"

# ----------------------------------------------------------------------------
# 4) Launch solvers in parallel
#    - OpenFOAM: 128 MPI processes on node #1
#    - CalculiX: 1 process (with 128 OMP threads) on node #2
#
# We use 'set -m' + subshell + background processes + 'wait'
# so both solvers run in parallel and the script waits for both.
# ----------------------------------------------------------------------------
set -m

(
  # --- 4a) OpenFOAM (MPI, node #1) ---
  cd fluid
  mpirun -np 128 -hostfile ../hosts_openfoam buoyantSimpleFoam -parallel > log.buoyantSimpleFoam &
  cd ..
  echo "OpenFOAM is running on $OF_NODE"

  # --- 2b) CalculiX (using srun) ---
  echo "Starting CalculiX on $CCX_NODE"
  srun --nodes=1 --ntasks=1 --nodelist="$CCX_NODE" \
       bash -c "
         cd solid
         export OMP_NUM_THREADS=128
         export CCX_NPROC_EQUATION_SOLVER=128
         ccx_preCICE -i solid -precice-participant Solid > log.ccx
       " &

  echo "Solid run started\n"
  
  wait
)

echo "All participants have completed."

fsimonis · February 26, 2025, 12:41pm

Hi,
I split the topic after the python problem disappeared.

CalculiX on HPC

The CCX adapter itself is not designed for being run in parallel with MPI. That said, it can be still launched on its own node using MPI mpirun -n 1 --hostfile=X.

github.com/precice/calculix-adapter

adapter/PreciceInterface.c

1690a6736


      
          // Create the solver interface and configure it - Alex: Calculix is always a serial participant (MPI size 1, rank 0)
          precicec_createParticipant(participantName, adapterConfig.preciceConfigFilename, 0, 1);

To my understanding, CCX is best used on fat nodes of your cluster, using OpenMP threads to take advantage of the whole node.
If performance is a problematic, you can switch to their PastiX solver with optional CUDA support.

CacluliX discourse is probaly your best sources of information here. @mattfrei is there any information to be added?

CCX and Slurm session partitioning

and

Create a hostfile with the first node for CalculiX and a hostfile with the remaining nodes for OpenFOAM. In the documentation we use something like this:

head -1 hosts.ompi > hosts.ccx
tail +2 hosts.ompi > hosts.of

Then start one CCX with one rank using the hosts.ccx hostfile, and OpenFOAM with the other.

This is pretty much what you are doing right now.

This may be different on heterogenous clusters, and we have no experience with this so far.

It may be that you need to use job farming, by queuing one job of one node on the partition with the fat nodes, and another job for n nodes on the partition with the normal nodes. These jobs need to be launched together in order not to waste resources.

The LRZ, home the SuperMUC-NG, has some documentation on this subject.

In any way, it’s probably best to get hands-on time with your system admin to figure this out.

Communication cost in coupled simulations

It is always tricky to make claims about communication cost when coupling simulations using preCICE.

The communication cost in terms of pure transfer is generally not an issue.
The observed communication cost includes various waiting times and is heavily influenced by

the used coupling scheme (especially serial)
the amount of ranks per participant
the load balance of your participant including the runtime of the coupled solver and data mapping schemes in preCICE.

This is why we developed profiling tools that give you a visual overview of all ranks and participants at the same time. The visual representation of these wait time is invaluable to localizing the problem at hand.

I recommend trimming your simulation with <max-time-windows ... /> and enabling <profiling mode="all" /> to get the full picture.

Hope that helps!

Umut · February 27, 2025, 11:50am

I’m extremely delighted and grateful for the guidance I’ve received here, on https://precice.discourse.group/t/does-the-executable-binary-file-of-calculix-support-running-on-a-slurm-cluster-via-mpi/2336/6 and on the CCX forum: https://calculix.discourse.group/t/can-calculix-run-across-multiple-nodes/1316/6. Thank you all for your help. Below, I’ve summarized my situation and outlined my next steps.

Goal

I aim to transition from running my simulation on a 6-core CPU on my PC to an HPC system with 128-core nodes, targeting at least a 10x speed-up (using 20+ times more cores). However, so far, I’ve only achieved a 1x to 4x speed-up.

Case Details

I’m running a steady-state Conjugate Heat Transfer (CHT) case with radiation, involving one fluid and one solid participant. The coupling is handled using parallel-implicit mode with the same preCICE configuration as in the heat-exchanger tutorial.

For radiation modeling, I have two options:

fvDOM in OpenFOAM: After 4–5 timesteps, coupling iterations per timestep drop to 1 (almost like explicit coupling).
Cavity radiation in CCX: Requires ~10 coupling iterations per timestep but provides more reliable results.

Performance on HPC

On my PC (6 cores), the case runs successfully.
On HPC (128-core nodes), I expected a 10x speed-up when using 1–2 nodes (20–40x more cores).
However, results show:
- fvDOM in OpenFOAM: ~4x speed-up.
- Cavity radiation in CCX: <2x speed-up, despite a 20x increase in core count.

From OpenFOAM’s executionTime output, I see that OpenFOAM scales well (tested up to 100 cores). However, the overall simulation time does not decrease significantly, suggesting an issue with CCX or coupling.

My assuptions

If CCX is correctly configured (with Spooles, Pardiso, or PaStiX and a proper Slurm script), it should scale reasonably well up to ~100 cores in a single node using OpenMP, rather than just 4–8 cores.
If this is true, the issue could be:
1. A bad Slurm script
2. The need to switch solvers (from Spooles to Pardiso/PaStiX)

Next Steps

Enable deeper profiling via adding lines toprecice-config.xml as @fsimonis suggested, to track communication and CCX execution time.
Fix the Slurm script: Run CCX on a single node and OpenFOAM on another, avoiding synchronization issues. (hopefully I can fix the problem where simulation stuck at participants waiting for each other)
Install PaStiX (Spack installation available).
Install Pardiso.
Test different CPU allocations and solvers (Spooles, Pardiso, PaStiX) on the HPC and compare performance results.

I have limited experience with HPC installations, Slurm scripts, and hostfiles, and I also have other responsibilities, so progress might be slow. However, I will share my findings here as I move forward.

Meanwhile, if anyone with experience in CCX on HPC has additional insight to share, I would greatly appreciate it.

Kind regards,
Umut

fsimonis · February 27, 2025, 12:33pm

This sounds indeed like OpenFOAM outperforms CCX significantly and spends time waiting for data from CCX. This should be trivial to spot when checking the profiling information visually.

Can you verify that CalculiX uses the expected amount of OpenMP threads using OMP_DISPLAY_ENV? It is possible that OMP_NUM_THREADS this is not set in your favour. Also you may want to check the effect of OMP_PROC_BIND on CCX.

Let us know if you run into some strange behaviour when using the CCX adapter!

Something that could be interesting to you if you are feeling adventurous:
We are currently testing a change aiming to reduce lag in the socket implementation. Its motivation is to reduce performance degradation when using substeps="true" in exchange tags, but I am measuring a decent performance uplift for the current default of substeps="false".
We haven’t tested this on HPC yet, so if you find the time to try this change, it could be beneficial to both of us.

The change is small enough to be applied manually.

github.com/precice/precice

Disable Nagle alogrithm in sockets

develop ← fsimonis:disable-sockets-nagle

opened 09:42PM - 21 Feb 25 UTC

fsimonis

+9 -0

## Main changes of this PR This PR disables the Nagle algorithm in the sockets …communication, which purpose is to reduce package count in the network at the cost of latency. ## Motivation and additional information This introduced latency which leads to performance degradation in https://github.com/precice/precice/issues/1679#issuecomment-2671898614 This could lead to network congestion and may be a good candidate for a configuration option. ## Author's checklist * [x] I used the [`pre-commit` hook](https://precice.org/dev-docs-dev-tooling.html#setting-up-pre-commit) to prevent dirty commits and used `pre-commit run --all` to format old commits. * [ ] I added a changelog file with `make changelog` if there are user-observable changes since the last release. * [ ] I added a test to cover the proposed changes in our test suite. * [ ] For breaking changes: I documented the changes in the appropriate [porting guide](https://precice.org/couple-your-code-porting-overview.html). * [x] I stuck to C++17 features. * [x] I stuck to CMake version 3.22.1. * [ ] I squashed / am about to squash all commits that should be seen as one.

Umut · March 4, 2025, 11:44am

Dear all,

I’ve made the following progress so far:

Loaded necessary modules, updated the ccx_precice Makefile, and recompiled to use PARDISO along with SPOOLES. (Installation was easier than I anticipated.)
Considered installing PaStiX and MUMPS but prioritized other tasks.
Conducted performance tests comparing PARDISO and SPOOLES.
Added export OMP_DISPLAY_ENV=TRUE in the SLURM script to obtain more details on the ccx_precice execution.
Enabled detailed profiling by adding <profiling mode="all" /> in the XML file, merged the JSON files, created trace.json, and uploaded it to ui.perfetto.dev to visualize the profiling results.
Attempted to run the case using hostfiles but failed.

Switching to PARDISO reduced the clockTime by ~25%. However, it did not resolve the extreme difference between execution time and clock time reported in the OpenFOAM (OF) log file. This suggests that despite CCX handling fewer cells (~14k) compared to OF (~1M), CCX + communication takes 4 to 10 times longer than OF.

Current Issue: Running CCX & OpenFOAM on Separate Nodes

I suspect that correctly configuring my SLURM script to use hostfiles—running CCX on one node and OF on another—might resolve the issue. However, all my attempts have failed so far. Here are the methods I tested:

Using taskset on single node (run.sh):

export OMP_NUM_THREADS=20
export CCX_NPROC_EQUATION_SOLVER=20
taskset -c 0-23 ccx_preCICE -i solid -precice-participant Solid > log.ccx &

Using Hostfiles on 2 nodes (hostrun.sh):

set -m 
(
  cd solid
  export OMP_NUM_THREADS=128
  export CCX_NPROC_EQUATION_SOLVER=128
  echo "Starting CalculiX on node: $(awk '{print $1}' ../hosts_ccx)"
  mpirun -np 1 \
         -hostfile ../hosts_ccx \
         ccx_preCICE -i solid -precice-participant Solid \
         > log.ccx 2>&1
) &

Using srun on 2 nodes (srun.sh):

set -m 
(
  cd solid
  echo "Launching CalculiX with 1 MPI rank + 128 threads..."
  export OMP_NUM_THREADS=128
  export CCX_NPROC_EQUATION_SOLVER=128
  srun --nodes=1 --ntasks=1 \
       ccx_preCICE -i solid -precice-participant Solid \
       > log.ccx 2>&1
) &

Both hostfile-based (mpirun) and srun approaches have failed. The log files and thrown error look like this:

#--------------------------------------- Slurm Output ---------------------------------------#
hosts_ccx:
a048 slots=128
hosts_of:
a085 slots=128
Starting OpenFOAM on nodes: a085
Starting CalculiX on node: a048
#--------------------------------------- OF Output ---------------------------------------#
Host key verification failed.
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-a048-107109@0,0] on node a048
  Remote daemon: [prterun-a048-107109@0,1] on node a085

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

What I Need Help With:

SLURM Script Example

A proper SLURM script designed to run on an HPC system
Participants: ccx_precice and OpenFOAM
Uses hostfiles to ensure CCX and OF run on separate nodes

Guidance

How should I modify my SLURM script to use hostfiles correctly?
Any insights into resolving the host key verification issue?

Additionally, while I’ve added OMP_DISPLAY_ENV and generated visual profiling results, I’m unsure how to properly evaluate these.

I’m attaching:

My SLURM script (hostrun.sh)
trace.json files (ran for 10 steps) and SLURM outputs for SPOOLES and PARDISO cases

At this point, I’m unsure what to try next and would greatly appreciate your input.

Looking forward to your suggestions!

Cheers,
Umut

Note @fsimonis: I am interested to try what you suggested about substeps="false" but prioritized other steps and couldn’t find the time yet.

Uploaded trace.json files and slurm outputs for SPOOLES and PARDISO cases:
Drive link for resuls and logs

Slurm Script (hostrun.sh):

#!/bin/bash
#SBATCH -A project
#SBATCH -n 256 # total cores = 2 nodes × 128 cores each
#SBATCH -p queue
#SBATCH -t 02:00:00

# ------------------------------------------------------------------------------
# 0) Environment modules. Load everything needed for your HPC environment, OpenFOAM, preCICE, CCX.
# ------------------------------------------------------------------------------
module purge
module load intel/oneAPI-compiler-2022.2.0
module load gcc/12.3.0
source ~/apps/spack/share/spack/setup-env.sh
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a
spack load openmpi@5.0.6 netlib-lapack@3.12.1 openblas@0.3.29 \
           yaml-cpp@0.8.0 openfoam@2312 precice@3.1.2 petsc@3.22.3

export PMIX_MCA_psec=none
export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/pastix-6.4.0-girmii3lhfr3zmevykyuc7wh4tn3diu7/lib:$LD_LIBRARY_PATH

echo "Case folder: $(pwd)"

# ------------------------------------------------------------------------------
# 1) Clean/prepare the case
# ------------------------------------------------------------------------------
caseFolder=$(pwd)
if [ -x "$caseFolder/clean-tutorial.sh" ]; then
  "$caseFolder/clean-tutorial.sh" > clean-tutorial.log
fi

# ------------------------------------------------------------------------------
# 2) Generate OpenMPI hostfiles
#    Slurm gives us two nodes for 256 total cores (128 cores/node).
# ------------------------------------------------------------------------------
rm -f hosts.ompi

# 2a) List the allocated hosts and create a single "hosts.ompi" file.
for host in $(scontrol show hostname "$SLURM_JOB_NODELIST"); do
  echo "${host} slots=128" >> hosts.ompi
done

# 2b) Now split:
#   - 'hosts_ccx' contains the FIRST node (1 line)
#   - 'hosts_of'  contains the REMAINING nodes
head -n 1 hosts.ompi > hosts_ccx
tail -n +2 hosts.ompi > hosts_of

# Let's see what we got:
echo "hosts_ccx:"
cat hosts_ccx
echo "hosts_of:"
cat hosts_of

# ------------------------------------------------------------------------------
# 3) Decompose the OpenFOAM case if you haven't done it beforehand.
# ------------------------------------------------------------------------------
cd fluid
decomposePar -force > log.decomposePar
cd ..

# ------------------------------------------------------------------------------
# 4) Launch Solvers in Parallel
#    - CCX: 1 MPI rank on the first node, but 128 OpenMP threads
#    - OpenFOAM: 128 MPI ranks on the other node(s)
# ------------------------------------------------------------------------------
set -m  # enable job control (so we can background processes & wait for them)

# 4a) Start CalculiX on the FIRST node (head -n 1 => hosts_ccx)
#     We use mpirun -np 1 --hostfile to place 1 MPI process on that node.
(
  cd solid
  export OMP_NUM_THREADS=128
  export CCX_NPROC_EQUATION_SOLVER=128
  echo "Starting CalculiX on node: $(awk '{print $1}' ../hosts_ccx)"
  
  mpirun -np 1 \
         -hostfile ../hosts_ccx \
         ccx_preCICE -i solid -precice-participant Solid \
         > log.ccx 2>&1
) &

# 4b) Start OpenFOAM on the OTHER node(s) (tail -n +2 => hosts_of)
(
  cd fluid
  echo "Starting OpenFOAM on nodes: $(awk '{print $1}' ../hosts_of)"
  
  mpirun -np 128 \
         -hostfile ../hosts_of \
         buoyantSimpleFoam -parallel \
         > log.buoyantSimpleFoam 2>&1
) &

# 4c) Wait for BOTH background processes to complete
wait

echo "All participants have completed."

Umut · March 5, 2025, 2:44pm

Hello @fsimonis,

I’ve recently completed a new case study on running a coupled simulation using ccx_precice and OpenFOAM. To eliminate human error, I automated every step by using scripts—from case creation and job submission to post‑processing (including log parsing and result collection). I ensured that there is no manual editing; every case is generated and processed automatically (and I even ran the whole set twice for consistency).

For this study, I used the cavity radiation model in CalculiX (which gave more reliable results despite requiring more iterations per time step) and ran all cases with PARDISO. In total, I ran nine cases, each for 50 time steps. (all tests are conducted on 128 core/node same hardware and same queue)

In the base case (test.03), I allocated 64 cores for both OpenFOAM and CalculiX on a single node (using taskset to assign cores) with network set to ib0, and both solvers were launched in the background (with a final wait command). I did not set any value for OMP_PROC_BIND in that case. I then compared the executionTime and clockTime reported in the OpenFOAM log at the 50th time step. Below is a summary of my results:

Test	OF #procs	OF #cores	CCX #procs	CCX #cores	OMP_PROC_BIND	Other	ExcTime (1st Run)	ClckTime (1st Run)	ExcTime (2nd Run)	ClckTime (2nd Run)
1	64	64	64	64	–	no ib0	–	–	281.03	299
2	64	64	64	64	–	no wait	18.09	302	14.63	300
3	64	64	64	64	–	–	–	–	21.68	295
4	100	100	28	28	–	–	32.65	279	17.68	286
5	100	104	20	24	–	–	258.54	273	–	–
6	100	100	20	20	–	–	271.56	294	259.81	274
7	64	64	64	64	spread	–	14.73	2092	9.75	2073
8	64	64	64	64	close	–	2059.52	2076	9.78	2059
9	64	64	64	64	master	–	2054.95	2071	–	–

My observations:

Some cases fail to start randomly (e.g., tests 1 and 3 in the first run; tests 5 and 9 in the second run). (See error message below.)
When I set OMP_PROC_BIND (tests 7, 8, 9), the simulations run about 7.5 times slower, regardless of whether I use “spread,” “close,” or “master.”
Execution times vary substantially even when clockTime remains nearly constant. For instance, in test 8, executionTime dropped from around 2000 to 10 seconds without a corresponding change in clockTime.
Changing core allocations (tests 3, 4, 5, 6) produced little change in clockTime (ranging from 273 to 295 s).

Error note in slurm output:

--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[51280,1],69]
  Exit code:    255
--------------------------------------------------------------------------

I suspect this may be related to running both CCX and OpenFOAM on a single node with taskset. I wonder if running CCX and OpenFOAM on separate nodes might improve stability and clarify the issue. I’ve attached the full slurm output and my slurm script at the end of this post.

My questions:

Have you (or anyone you know) run ccx_precice and OpenFOAM on separate nodes? If so, could you please share a tested complete slurm script?
What is your interpretation of these results? What might be causing the random startup failures and the performance discrepancies?

I appreciate all your help so far and hope these additional details are useful. Thank you very much.

Sincerely,
Umut

Slurm script of test.07:

#!/bin/bash
#SBATCH -A project
#SBATCH -n 128
#SBATCH -p queue


# --- 1) LOAD ENVIRONMENT --- #
module purge
module load intel/oneAPI-compiler-2022.2.0
module load gcc/12.3.0
source ~/apps/spack/share/spack/setup-env.sh
module load ek-moduller-easybuild
module load arpack-ng/3.9.0-foss-2023a
spack load openmpi@5.0.6 netlib-lapack@3.12.1 openblas@0.3.29 \
           yaml-cpp@0.8.0 openfoam@2312 precice@3.1.2 petsc@3.22.3

export PMIX_MCA_psec=none
export PATH=$HOME/apps/calculix-adapter-master/bin:$PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/precice-3.1.2-ukxuqmx2goykhc5c4tyw3huhawqxokwo/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/yaml-cpp-0.8.0-fypxvjn4bec57zb4rq3pb4aqp62vlog7/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/netlib-lapack-3.12.1-6wv46i4ij6ilsumqpyw23hmrwpwi7b5q/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/petsc-3.22.3-f6fi4zethcvvsqnq6wv5fwxprhsphju7/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/apps/spack/opt/spack/linux-rhel8-zen2/gcc-12.3.0/pastix-6.4.0-girmii3lhfr3zmevykyuc7wh4tn3diu7/lib:$LD_LIBRARY_PATH

# --- 2) CLEAN CASE & Decompose --- #
caseFolder=$(pwd)
# "$caseFolder/clean-tutorial.sh" > clean-tutorial.log
cd $caseFolder/fluid
decomposePar -force > log.decomposePar
cd $caseFolder


# --- 3) RUN CALCULIX --- #
cd solid
echo "Before Run:"
echo "OMP_PROC_BIND is $OMP_PROC_BIND"
echo "OMP_NUM_THREADS is $OMP_NUM_THREADS"
echo "OMP_DISPLAY_ENV is $OMP_DISPLAY_ENV"

export OMP_DISPLAY_ENV=TRUE
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=64
export CCX_NPROC_EQUATION_SOLVER=64

taskset -c 0-63 ccx_preCICE -i solid -precice-participant Solid > log.ccx &

echo "After Run:"
echo "OMP_NUM_THREADS is $OMP_NUM_THREADS"
echo "OMP_PROC_BIND is $OMP_PROC_BIND"
echo "OMP_DISPLAY_ENV is $OMP_DISPLAY_ENV"
cd ..

# --- 4) RUN OpenFOAM --- #
cd fluid
taskset -c 64-127 mpirun -np 64 buoyantSimpleFoam -parallel > log.buoyantSimpleFoam &
cd ..

# --- 5) WAIT FOR BOTH PROCESSES --- #
wait

Complete slurm output (Test.05 2nd run):

Before Run:
OMP_PROC_BIND is
OMP_NUM_THREADS is
OMP_PROC_BIND is
OMP_DISPLAY_ENV is
After Run:
OMP_NUM_THREADS is 20
OMP_PROC_BIND is
OMP_DISPLAY_ENV is TRUE

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '20'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '1'
  OMP_NUM_TEAMS = '0'
  OMP_TEAMS_THREAD_LIMIT = '0'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  OMP_ALLOCATOR = 'omp_default_mem_alloc'
  OMP_TARGET_OFFLOAD = 'DEFAULT'
OPENMP DISPLAY ENVIRONMENT END

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='1'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED: deprecated; max-active-levels-var=1
  [host] OMP_NUM_TEAMS='0'
  [host] OMP_NUM_THREADS='20'
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_TEAMS_THREAD_LIMIT='0'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_TOOL_VERBOSE_INIT: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51280,1],69]
  Exit code:    255
--------------------------------------------------------------------------
slurmstepd: error: *** JOB 286932 ON a038 CANCELLED AT 2025-03-05T16:04:00 ***

===== SLURM IS ISTATISTIKLERI (JOB STATISTICS) ========================
Job ID: 
Cluster: 
User/Group: 
State: CANCELLED (exit code 0)
Nodes: 1
Cores per node: 128
CPU Utilized: 00:28:28
CPU Efficiency: 0.83% of 2-09:18:56 core-walltime
Job Wall-clock time: 00:26:52
Memory Utilized: 9.24 GB
Memory Efficiency: 3.79% of 244.14 GB

=======================================================================

fsimonis · March 13, 2025, 10:37am

There is a slurm feature called heterogeneous jobs, which seems to be the modern and correct way of simultaneously starting jobs with different requirements.

I will be looking into this as this may simplify our documentation on SLURM sessions. It also opens the doors to simpler scheduling to GPU and fat nodes for example.

This strongly suggests a problem with the setup of the cluster. I highly recommend reporting this to your system admins. You could try to remove ~/.ssh/known_hosts, but I doubt this will be sufficient in this case.

The most important is to check if CalculiX uses the correct amount of threads. Your provided spooles run for example uses 20 threads as idicated by the following line in the output:

 [host] OMP_NUM_THREADS='20'

Generally interesting is that the first OpenFoam time window is slow and subsequent time windows are fast.
Meaning, there are 3 phases to your simulation:

participant initialization/startup time, which should stay pretty much the same for this setup
first time window, which should get worse with OF problem size
later time windows, which should remain stable as CalculiX is the bottleneck

I’ll have a look at 3., which is the most important in long-running simulations.

Spooles case

Here you can clearly see that the simulation is bottlenecked by CalculiX.
The OpenFOAM solvers spends the majority of its time waiting for data from CalculiX in advance/m2n.receiveData.

Paradiso case

This one is looking way better. CalculiX is significantly faster here, but still the bottleneck.
Ideally for your parallel scheme, the solver.advance of both solvers should be the same. Here CalculiX runs at around 500ms and OpenFoam at around 200ms.

This is an interesting observation. I’ll keep this in mind.

This looks like some massive hangup though. Could be the file system. Not 100% sure if this helps, but you could try calling sync between your benchmarks.

Your script looks pretty much exactly what we would expect. I may get around to testing this on the SuperMUC-NG. I’ll let you know if I can share some insights.

My interpretation is

the case is still bottlenecked by CalculiX
there may be some hang times due to filesystem latency, which can be challenging to debug (experience talking)
inspecting the overall time includes start-up, initialization and other wait times. These are generally negligible in real simulations that run for thousands of time steps. For 10 time
inspecting the visual output of the trace files or the ratio between solver.advance and advance per participant using precice-profiling analyze may be a more effective way to assessing the efficiency of your simulation.

As CalculiX is your main bottleneck, it may be worth trying a few things (untested):

trying the GPU solver
upgrading your compilers and linker as far as possible. You are already missing out on 3 years of compiler optimizations, especially for newer architecture. Maybe even clang or newer intel compilers.
Trying to activate link time optimization for the CalculiX sources.

Topic		Replies	Views
Fluid-structure coupling: openfoam has a very short computation time and a long data exchange and mapping time Using preCICE openfoam , data-mapping , calculix , fsi	2	170	March 26, 2024
Running the case on the cluster without errors, but hanging for a long time in the first step Using preCICE openfoam , calculix , fsi	5	61	October 12, 2024
Point-to-point communication on a cluster Using preCICE communication , configuration , parallel	3	537	June 28, 2021
A coupling problem between Openfoam and Calculix Using preCICE openfoam , calculix , fsi	3	134	June 7, 2024
Calculix parallelization Official adapters and tutorials openfoam , calculix , fsi	11	3394	April 19, 2021