[OpenFOAM] Unable to run SLURM partitioned simulation

Hello preCICErs,

I’ve been trying to adapt the partitioned-pipe tutorial to run as SLURM job but I’m facing issues.
[NOTE: The serial execution in the original version tutorial runs without error.]

I’ve a script
run.sh (1.6 KB)
in the case directory which takes care of the node partitioning and launches the individual participants.

Participants runscripts:
run.fluid1.sh (351 Bytes)
run.fluid2.sh (351 Bytes)

preCICE config:
precice-config.xml (2.9 KB)

When I launch the simulation, Fluid1 stops with this message:

---[precice] ^[[0m Setting up master communication to coupling partner/s

… and Fluid2 stops with this message:

Host key verification failed.^M
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

I’ve followed the steps in Help! The participants are not finding each other! - #2 by Makis but still couldn’t troubleshoot.
As I am relatively new to using preCICE, any help would be appreciated. Thanks!

The message

Host key verification failed

from the second solver sounds like a problem of tasks connecting to each other between nodes. Are you able to run other programs that do not use preCICE across several nodes?

Yes, I use OpenFOAM without preCICE and they run fine across multiple nodes. For example, here is a job running over 6 nodes:

  JOBID PARTITION     NAME     USER ST       TIME       START_TIME         END_TIME  TIME_LEFT  NODES MIN_MEMOR PRIOR SCHEDNODES NODELIST(REASON)     SCHEDNODES
7932235 imb-resou r5.ADM.0 nkumar00  R 1-14:54:46 2022-09-12T19:19 2022-09-14T19:19    9:05:14      6       93G   443 (null) n[342-343,361-364]     (null)

I was able to run the partitioned pipe tutorial with an in-house solver using Slurm a few months ago. You say you tried the steps in the guide you linked, yet still you have ./ as the exchange directory in your precice-config.xml. According to your run.sh your two solvers work in two different directories: $parentDir/fluid1-openfoam-pimplefoam and $parentDir/fluid2-openfoam-pimplefoam so this is definitely not correct. The path for the exchange directory is relative to where the solvers run, so each solver is looking in its own directory and therefore not finding the other solver. Try putting ../ as the exchange directory. You should end up with a precice-run directory in $parentDir after running.

This might be your problem, but if it were I would expect both solvers to hang indefinitely rather than one raising an error. It could be a network interface problem, though I don’t know your cluster so can’t say what it might be. One possible difference is that I used separate sbatch invocations for each solver.

@Ray_Scarr Thanks for your reply. Yes, you’re right that the path of the exchange directory should have been ../. I had misinterpreted the remark about relative paths in Help! The participants are not finding each other! - #2 by Makis. However I re-ran the simulation with the correction and the error still persists.

Secondly, I ran some more tests and it seems that the error has to do with the Open MPI library that I am using. I will give more details below as a continuation to @ajaust’s remark.


Context:
I am running preCICE in a spack environment as I had faced some issues previously during installation. As such, the Open MPI library is also provided by the environment.


Running a simple Open MPI tutorial as a partitioned job:
I tried to reproduce the error using a simple MPI Hello World script.

The instructions to set up the test it and the slurm job submission script can be found here: run.sh (1.6 KB).

The log files generated are as follows:

(1) log.runOne

Hello world from processor n345, rank 0 out of 2 processors
Hello world from processor n345, rank 1 out of 2 processors

(2) log.runTwo

Host key verification failed.^M
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

As we can see, the error mentioned in my first post is replicated. The first execution shows no error but the second execution fails.

I then made the following change in the slurm job submission script:

#- Load modules
module purge
#- Activate spack environment which also loads mpirun
# spack env activate foamPrecice
module load openmpi/4.0.2/gcc/7.3.0

Basically, this uses the MPI library provided as a module on the cluster instead of the one used earlier which was provided by the spack environment.

With this, the error disappears and I get the output for both MPI executions.

(1) log.runOne:

Hello world from processor n351, rank 0 out of 2 processors
Hello world from processor n351, rank 1 out of 2 processors

(2) log.runTwo:

Hello world from processor n352, rank 0 out of 2 processors
Hello world from processor n352, rank 1 out of 2 processors

So the problem really seems to be with the Open MPI library loaded with the spack environment.

So to answer @ajaust, I am able to execute jobs over multiple nodes but due to some reasons, it fails in this specific case. As I am not sure how the host key verification is performed differently in both cases, I am out of ideas, apart from rebuilding the spack environment from scratch.


Running a preCICE tutorial as a partitioned job
I also tested the same on the nice C++ parallel solver dummies written by @ajaust.

I first run the serial version using this job submission script:
run.serial.sh (438 Bytes). This ran as intended and there was no error reported in the log files.
This confirms that:
(1) preCICE is installed correctly.
(2) preCICE jobs are able to be executed from a single slurm job submission script.

In the next step, I tried to set the number of processors for both MPI executions as N=M=2 in the slurm job submission script: run.parallel.sh (1.4 KB).
This generates the following log files:
(1) log.runOne

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s

(2) log.runTwo

Host key verification failed.^M
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

The second run fails again.


In the next attempt, I tried to load the MPI module provided by the cluster after activating the spack environment by making the following change in the job submission script:

#- Load modules
module purge
spack env activate foamPrecice
module load openmpi/4.0.2/gcc/7.3.0

Interestingly this time, I didn’t get an error but both the simulations got stuck. Here are the log files:

(1) log.runOne

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Masters are connected
preCICE:^[[0m Setting up preliminary slaves communication to coupling partner/s
preCICE:^[[0m Prepare partition for mesh MeshOne
preCICE:^[[0m Gather mesh MeshOne
preCICE:^[[0m Send global mesh MeshOne
preCICE:^[[0m Setting up slaves communication to coupling partner/s
preCICE:^[[0m Masters are connected
preCICE:^[[0m Setting up preliminary slaves communication to coupling partner/s
preCICE:^[[0m Prepare partition for mesh MeshOne
preCICE:^[[0m Gather mesh MeshOne
preCICE:^[[0m Send global mesh MeshOne
preCICE:^[[0m Setting up slaves communication to coupling partner/s

(2) log.runTwo

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverTwo", and mesh name "MeshTwo".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverTwo", and mesh name "MeshTwo".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverTwo"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s

I let it run for 30 minutes but it didn’t proceed further.


This is all I could test so far.

Great job debugging so far and nice to see that the parallel solver dummies are helpful. That at least eliminates OpenFOAM from the equation at the moment.

It could be that you have some MPI installation issues that may be two-fold. Here are some guesses:

  1. Host key verification failed: What you observe could be some misconfiguration of the OpenMPI compiled with Spack. Maybe it needs some SLURM options activated such that you can start two jobs from within one job script. Spack’s OpenMPI recipe has an schedulers option. There are also quite some other options that might affect the behaviors (e.g. PMI, PMIX). Maybe it would be worth asking your admins about that.
  2. preCICE connection fails for openmpi/4.0.2/gcc/7.3.0: There might be a with the mixing of different MPI versions/installations, e.g. preCICE was compiled with the Spack OpenMPI and now runs with the system’s OpenMPI. Similar problems have been observed when preCICE is used with OpenFOAM since OpenFOAM sometimes installs its own MPI. When preCICE and OpenFOAM have been compiled with different MPI installation one tends to get “weird” issues. There should also some posts about that issue on the forum.

Some further ideas/remarks:

Tests with MPI Hello World script

Can you avoid the “Host key verification failed” error by starting both executables from one line. This could be something similar to

mpirun -n $N -hostfile hosts.runOne ./mpi_hello_world &> log.runOne : -n $M -hostfile hosts.runTwo ./mpi_hello_world &> log.runTwo 

I am not sure if the piping to files works like this when using the colon operator in the command.

Source: OpenMPI docs

Tests with parallel solver dummies

  • You could try to start the dummies with MPIs colon operator if that worked for the MPI Hello world script.
  • For the last case (parallel solver dummies with openmpi/4.0.2/gcc/7.3.0) where the connection between the processes you could try to delete the precice-run/ directory and then restart the solvers. Just in case you did not try that yet.
  • What could be a problem here is that there might be a mixup of MPI versions now. Could you recompile preCICE using the OpenMPI version provided by the system and check whether this solves the problems?
  • More a note that a suggestion: I have worked on one system where the creation of the connection between coupling partners was super slow. That system also used SLURM and I usually had to wait 10-20 minutes for the connection between coupling partners was initiated. However, you have waited 30 minutes so I am not sure whether that should be the issue you are facing.

I will post again if I have some other ideas for the issue.

1 Like

Thanks Alex for the suggestions.

Can you point me to some resources regarding these options? Is it something I need to take care of during preCICE installation?

I’ve been in contact with the HPC team as well. I’ll update if there is any positive development.

I will check the forum for more issues like this. However in my build, both preCICE and OpenFOAM use the same MPI recipe as they are installed in the same spack environment.

I was thinking that if anyone wants to replicate the behaviour, the steps to recreate the spack environment that I am using can be found on the top of this comment [OpenFOAM-6 adapter] Error during build: undefined symbols of dependencies - #5 by nish-ant.

It gives the same Host key verification failed error for runTwo.

I made sure that the precice-run/ directory is removed before the next run but the error still persists.

I can try to build a new spack environemt where both preCICE and OpenFOAM are linked to the Open MPI library provided on the cluster. Again, I will update with the findings.

Yes, this is something you need to take care of yourself. What would be the best way depends a lot on how you install preCICE in Spack. In the packages.yaml you could set default settings for packages to be built and also external packages like your existing OpenMPI installation.

You can also pass options to dependencies directly in the command line, i.e. something like

spack install precice ^openmpi +legacylaunchers schedulers=slurm

see, for example, the Spack tutorial. Note that I did not check whether the command here actually makes sense and would lead to a useful installation.

1 Like

Got it. I will go through the documentation and reinstall preCICE.