[OpenFOAM] Unable to run SLURM partitioned simulation

@Ray_Scarr Thanks for your reply. Yes, you’re right that the path of the exchange directory should have been ../. I had misinterpreted the remark about relative paths in Help! The participants are not finding each other! - #2 by Makis. However I re-ran the simulation with the correction and the error still persists.

Secondly, I ran some more tests and it seems that the error has to do with the Open MPI library that I am using. I will give more details below as a continuation to @ajaust’s remark.

I am running preCICE in a spack environment as I had faced some issues previously during installation. As such, the Open MPI library is also provided by the environment.

Running a simple Open MPI tutorial as a partitioned job:
I tried to reproduce the error using a simple MPI Hello World script.

The instructions to set up the test it and the slurm job submission script can be found here: run.sh (1.6 KB).

The log files generated are as follows:

(1) log.runOne

Hello world from processor n345, rank 0 out of 2 processors
Hello world from processor n345, rank 1 out of 2 processors

(2) log.runTwo

Host key verification failed.^M
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

As we can see, the error mentioned in my first post is replicated. The first execution shows no error but the second execution fails.

I then made the following change in the slurm job submission script:

#- Load modules
module purge
#- Activate spack environment which also loads mpirun
# spack env activate foamPrecice
module load openmpi/4.0.2/gcc/7.3.0

Basically, this uses the MPI library provided as a module on the cluster instead of the one used earlier which was provided by the spack environment.

With this, the error disappears and I get the output for both MPI executions.

(1) log.runOne:

Hello world from processor n351, rank 0 out of 2 processors
Hello world from processor n351, rank 1 out of 2 processors

(2) log.runTwo:

Hello world from processor n352, rank 0 out of 2 processors
Hello world from processor n352, rank 1 out of 2 processors

So the problem really seems to be with the Open MPI library loaded with the spack environment.

So to answer @ajaust, I am able to execute jobs over multiple nodes but due to some reasons, it fails in this specific case. As I am not sure how the host key verification is performed differently in both cases, I am out of ideas, apart from rebuilding the spack environment from scratch.

Running a preCICE tutorial as a partitioned job
I also tested the same on the nice C++ parallel solver dummies written by @ajaust.

I first run the serial version using this job submission script:
run.serial.sh (438 Bytes). This ran as intended and there was no error reported in the log files.
This confirms that:
(1) preCICE is installed correctly.
(2) preCICE jobs are able to be executed from a single slurm job submission script.

In the next step, I tried to set the number of processors for both MPI executions as N=M=2 in the slurm job submission script: run.parallel.sh (1.4 KB).
This generates the following log files:
(1) log.runOne

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s

(2) log.runTwo

Host key verification failed.^M
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

The second run fails again.

In the next attempt, I tried to load the MPI module provided by the cluster after activating the spack environment by making the following change in the job submission script:

#- Load modules
module purge
spack env activate foamPrecice
module load openmpi/4.0.2/gcc/7.3.0

Interestingly this time, I didn’t get an error but both the simulations got stuck. Here are the log files:

(1) log.runOne

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Masters are connected
preCICE:^[[0m Setting up preliminary slaves communication to coupling partner/s
preCICE:^[[0m Prepare partition for mesh MeshOne
preCICE:^[[0m Gather mesh MeshOne
preCICE:^[[0m Send global mesh MeshOne
preCICE:^[[0m Setting up slaves communication to coupling partner/s
preCICE:^[[0m Masters are connected
preCICE:^[[0m Setting up preliminary slaves communication to coupling partner/s
preCICE:^[[0m Prepare partition for mesh MeshOne
preCICE:^[[0m Gather mesh MeshOne
preCICE:^[[0m Send global mesh MeshOne
preCICE:^[[0m Setting up slaves communication to coupling partner/s

(2) log.runTwo

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverTwo", and mesh name "MeshTwo".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverTwo", and mesh name "MeshTwo".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverTwo"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s

I let it run for 30 minutes but it didn’t proceed further.

This is all I could test so far.