@Ray_Scarr Thanks for your reply. Yes, you’re right that the path of the exchange directory should have been ../
. I had misinterpreted the remark about relative paths in Help! The participants are not finding each other! - #2 by Makis. However I re-ran the simulation with the correction and the error still persists.
Secondly, I ran some more tests and it seems that the error has to do with the Open MPI library that I am using. I will give more details below as a continuation to @ajaust’s remark.
Context:
I am running preCICE in a spack environment as I had faced some issues previously during installation. As such, the Open MPI library is also provided by the environment.
Running a simple Open MPI tutorial as a partitioned job:
I tried to reproduce the error using a simple MPI Hello World script.
The instructions to set up the test it and the slurm job submission script can be found here: run.sh (1.6 KB).
The log files generated are as follows:
(1) log.runOne
Hello world from processor n345, rank 0 out of 2 processors
Hello world from processor n345, rank 1 out of 2 processors
(2) log.runTwo
Host key verification failed.^M
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
As we can see, the error mentioned in my first post is replicated. The first execution shows no error but the second execution fails.
I then made the following change in the slurm job submission script:
#- Load modules
module purge
#- Activate spack environment which also loads mpirun
# spack env activate foamPrecice
module load openmpi/4.0.2/gcc/7.3.0
Basically, this uses the MPI library provided as a module on the cluster instead of the one used earlier which was provided by the spack environment.
With this, the error disappears and I get the output for both MPI executions.
(1) log.runOne:
Hello world from processor n351, rank 0 out of 2 processors
Hello world from processor n351, rank 1 out of 2 processors
(2) log.runTwo:
Hello world from processor n352, rank 0 out of 2 processors
Hello world from processor n352, rank 1 out of 2 processors
So the problem really seems to be with the Open MPI library loaded with the spack environment.
So to answer @ajaust, I am able to execute jobs over multiple nodes but due to some reasons, it fails in this specific case. As I am not sure how the host key verification is performed differently in both cases, I am out of ideas, apart from rebuilding the spack environment from scratch.
Running a preCICE tutorial as a partitioned job
I also tested the same on the nice C++ parallel solver dummies written by @ajaust.
I first run the serial version using this job submission script:
run.serial.sh (438 Bytes). This ran as intended and there was no error reported in the log files.
This confirms that:
(1) preCICE is installed correctly.
(2) preCICE jobs are able to be executed from a single slurm job submission script.
In the next step, I tried to set the number of processors for both MPI executions as N=M=2
in the slurm job submission script: run.parallel.sh (1.4 KB).
This generates the following log files:
(1) log.runOne
DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s
(2) log.runTwo
Host key verification failed.^M
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
The second run fails again.
In the next attempt, I tried to load the MPI module provided by the cluster after activating the spack environment by making the following change in the job submission script:
#- Load modules
module purge
spack env activate foamPrecice
module load openmpi/4.0.2/gcc/7.3.0
Interestingly this time, I didn’t get an error but both the simulations got stuck. Here are the log files:
(1) log.runOne
DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Masters are connected
preCICE:^[[0m Setting up preliminary slaves communication to coupling partner/s
preCICE:^[[0m Prepare partition for mesh MeshOne
preCICE:^[[0m Gather mesh MeshOne
preCICE:^[[0m Send global mesh MeshOne
preCICE:^[[0m Setting up slaves communication to coupling partner/s
preCICE:^[[0m Masters are connected
preCICE:^[[0m Setting up preliminary slaves communication to coupling partner/s
preCICE:^[[0m Prepare partition for mesh MeshOne
preCICE:^[[0m Gather mesh MeshOne
preCICE:^[[0m Send global mesh MeshOne
preCICE:^[[0m Setting up slaves communication to coupling partner/s
(2) log.runTwo
DUMMY (0): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverTwo", and mesh name "MeshTwo".
DUMMY (1): Running solver dummy with preCICE config file "../precice-config-parallel.xml", participant name "SolverTwo", and mesh name "MeshTwo".
preCICE:^[[0m This is preCICE version 2.3.0
preCICE:^[[0m Revision info: no-info [Git failed/Not a repository]
preCICE:^[[0m Configuration: Release (Debug and Trace log unavailable)
preCICE:^[[0m Configuring preCICE with configuration "../precice-config-parallel.xml"
preCICE:^[[0m I am participant "SolverTwo"
preCICE:^[[0m Connecting Master to 1 Slaves
preCICE:^[[0m Setting up master communication to coupling partner/s
preCICE:^[[0m Connecting Slave #0 to Master
preCICE:^[[0m Setting up master communication to coupling partner/s
I let it run for 30 minutes but it didn’t proceed further.
This is all I could test so far.