[OpenFOAM] Unable to run SLURM partitioned simulation

@Ray_Scarr Thanks for your reply. Yes, you’re right that the path of the exchange directory should have been ../. I had misinterpreted the remark about relative paths in Help! The participants are not finding each other! - #2 by Makis. However I re-ran the simulation with the correction and the error still persists.

Secondly, I ran some more tests and it seems that the error has to do with the Open MPI library that I am using. I will give more details below as a continuation to @ajaust’s remark.


Context:
I am running preCICE in a spack environment as I had faced some issues previously during installation. As such, the Open MPI library is also provided by the environment.


Running a simple Open MPI tutorial as a partitioned job:
I tried to reproduce the error using a simple MPI Hello World script.

The instructions to set up the test it and the slurm job submission script can be found here: run.sh (1.6 KB).

The log files generated are as follows:

(1) log.runOne

Hello world from processor n345, rank 0 out of 2 processors
Hello world from processor n345, rank 1 out of 2 processors

(2) log.runTwo

Host key verification failed.^M
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

As we can see, the error mentioned in my first post is replicated. The first execution shows no error but the second execution fails.

I then made the following change in the slurm job submission script:

#- Load modules
module purge
#- Activate spack environment which also loads mpirun
# spack env activate foamPrecice
module load openmpi/4.0.2/gcc/7.3.0

Basically, this uses the MPI library provided as a module on the cluster instead of the one used earlier which was provided by the spack environment.

With this, the error disappears and I get the output for both MPI executions.

(1) log.runOne:

Hello world from processor n351, rank 0 out of 2 processors
Hello world from processor n351, rank 1 out of 2 processors

(2) log.runTwo:

Hello world from processor n352, rank 0 out of 2 processors
Hello world from processor n352, rank 1 out of 2 processors

So the problem really seems to be with the Open MPI library loaded with the spack environment.

So to answer @ajaust, I am able to execute jobs over multiple nodes but due to some reasons, it fails in this specific case. As I am not sure how the host key verification is performed differently in both cases, I am out of ideas, apart from rebuilding the spack environment from scratch.


Running a preCICE tutorial as a partitioned job
I also tested the same on the nice C++ parallel solver dummies written by @ajaust.

I first run the serial version using this job submission script:
run.serial.sh (438 Bytes). This ran as intended and there was no error reported in the log files.
This confirms that:
(1) preCICE is installed correctly.
(2) preCICE jobs are able to be executed from a single slurm job submission script.

In the next step, I tried to set the number of processors for both MPI executions as N=M=2 in the slurm job submission script: run.parallel.sh (1.4 KB).
This generates the following log files:
(1) log.runOne

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s

(2) log.runTwo

Host key verification failed.^M
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

The second run fails again.


In the next attempt, I tried to load the MPI module provided by the cluster after activating the spack environment by making the following change in the job submission script:

#- Load modules
module purge
spack env activate foamPrecice
module load openmpi/4.0.2/gcc/7.3.0

Interestingly this time, I didn’t get an error but both the simulations got stuck. Here are the log files:

(1) log.runOne

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Masters are connected
preCICE:^[[0m Setting up preliminary slaves communication to coupling partner/s
preCICE:^[[0m Prepare partition for mesh MeshOne
preCICE:^[[0m Gather mesh MeshOne
preCICE:^[[0m Send global mesh MeshOne
preCICE:^[[0m Setting up slaves communication to coupling partner/s
preCICE:^[[0m Masters are connected
preCICE:^[[0m Setting up preliminary slaves communication to coupling partner/s
preCICE:^[[0m Prepare partition for mesh MeshOne
preCICE:^[[0m Gather mesh MeshOne
preCICE:^[[0m Send global mesh MeshOne
preCICE:^[[0m Setting up slaves communication to coupling partner/s

(2) log.runTwo

DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverTwo", and mesh name "MeshTwo".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverTwo", and mesh name "MeshTwo".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverTwo"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s

I let it run for 30 minutes but it didn’t proceed further.


This is all I could test so far.