I’m extremely delighted and grateful for the guidance I’ve received here, on https://precice.discourse.group/t/does-the-executable-binary-file-of-calculix-support-running-on-a-slurm-cluster-via-mpi/2336/6 and on the CCX forum: https://calculix.discourse.group/t/can-calculix-run-across-multiple-nodes/1316/6. Thank you all for your help. Below, I’ve summarized my situation and outlined my next steps.
Goal
I aim to transition from running my simulation on a 6-core CPU on my PC to an HPC system with 128-core nodes, targeting at least a 10x speed-up (using 20+ times more cores). However, so far, I’ve only achieved a 1x to 4x speed-up.
Case Details
I’m running a steady-state Conjugate Heat Transfer (CHT) case with radiation, involving one fluid and one solid participant. The coupling is handled using parallel-implicit mode with the same preCICE configuration as in the heat-exchanger tutorial.
For radiation modeling, I have two options:
- fvDOM in OpenFOAM: After 4–5 timesteps, coupling iterations per timestep drop to 1 (almost like explicit coupling).
- Cavity radiation in CCX: Requires ~10 coupling iterations per timestep but provides more reliable results.
Performance on HPC
- On my PC (6 cores), the case runs successfully.
- On HPC (128-core nodes), I expected a 10x speed-up when using 1–2 nodes (20–40x more cores).
- However, results show:
- fvDOM in OpenFOAM: ~4x speed-up.
- Cavity radiation in CCX: <2x speed-up, despite a 20x increase in core count.
From OpenFOAM’s executionTime output, I see that OpenFOAM scales well (tested up to 100 cores). However, the overall simulation time does not decrease significantly, suggesting an issue with CCX or coupling.
My assuptions
- If CCX is correctly configured (with Spooles, Pardiso, or PaStiX and a proper Slurm script), it should scale reasonably well up to ~100 cores in a single node using OpenMP, rather than just 4–8 cores.
- If this is true, the issue could be:
- A bad Slurm script
- The need to switch solvers (from Spooles to Pardiso/PaStiX)
Next Steps
- Enable deeper profiling via adding lines to
precice-config.xml
as @fsimonis suggested, to track communication and CCX execution time. - Fix the Slurm script: Run CCX on a single node and OpenFOAM on another, avoiding synchronization issues. (hopefully I can fix the problem where simulation stuck at participants waiting for each other)
- Install PaStiX (Spack installation available).
- Install Pardiso.
- Test different CPU allocations and solvers (Spooles, Pardiso, PaStiX) on the HPC and compare performance results.
I have limited experience with HPC installations, Slurm scripts, and hostfiles, and I also have other responsibilities, so progress might be slow. However, I will share my findings here as I move forward.
Meanwhile, if anyone with experience in CCX on HPC has additional insight to share, I would greatly appreciate it.
Kind regards,
Umut