I am setting up preCICE on a cluster and in some configurations I get the following error
[com::MPIDirectCommunication]:118 in getLeaderRank: ERROR: Unknown accessor name "BiotSolverMaster"!
and I have no idea where this comes from.
Some quick facts:
My simulation has two participants named BiotSolver and HDFlowSolver`.
Any of the two solverr might show the error message.
Both solvers are implemented in FEniCS. Therefore I use the preCICE-Python bindings.
I am still on preCICE 1.6.1 (due to compatibility reasons).
Most of the software has been installed Spack (including preCICE), but not FEniCS.
The simulation is started via SLURM.
I use mpich 3.3.1 with GCC 9.2.0.
This problems seems to become worse the more nodes I use. 2 nodes seems to be fine, but already 3 nodes seems to be a problem.
Sometimes the simulation seems to hang at the beginning similar to what is described in the documentation. The solutions proposed do not seem to solve my problem.
instead of mpirun makes the error getLeaderRank: ERROR: Unknown accessor name disappear.
The hanging of the simulation at the start seems to be preCICE distributing the meshes and computing the meshes (?!). I will monitor this further and report again.
Your job script indeed looks desparate
What is your preCICE config? Do you use MPI or TCP/IP for m2n communication? If TCP/IP which network do you use?
What is missing in your job script is to pin your mpi jobs to nodes like done here.
In the past we also had problems with participants started on the same node.
Thanks! At the moment I can say that starting the job from my jobscript with time srun --mpi=pmi2 -n 12 made the situation a bit more stable, but the simulation still tends to hand in the beginning where both solvers just show [impl::SolverInterfaceImpl]:232 in initialize: Setting up slaves communication to coupling partner/s. It seems to be be a bit random though so I cannot reproduce it properly at the moment.
I use TCP/IP (sockets) for the communication. I have tried to set the communication interface explicitly to eth0, but I am not sure if this changes much. I did it just now and restarted a setup that would else hang during initialization and it runs now. At the moment, I am not sure whether this is actually related or not. A lot of things happening seem to be a unpredictable at the moment.
Would you recommend trying the MPI communication interface? My libraries and solver a are built on top of MPICH so the support should be better than for OpenMPI.
I did not search for the term “Slurm” which would have led me to the page that you have linked. However, even then I am not sure if would have followed the job script on that page as it is not obvious whether it is so elaborate due to Slurm or due to the configuration of the supercomputer. Maybe we should add some of the information on network troubleshooting page or add a new page to the wiki that collects links to and/or information about running preCICE on clusters.
Is it a problem of Slurm or preCICE when participants share nodes? I tried to get my processes at least grouped by Slurm such that there would be minimal sharing with -m block, but it seems that Slurm does not care about that.
No Infiniband, but a bit faster than normal standard network. It is an internal cluster of the institute. I want to get some experience with my code + preCICE and MPI before I go to the big machine.
I will see how it is going when specifying the network interface explicitly. The cluster here at the moment is also undergoing heavy maintenance.
From the preCICE documentation
For certain systems, you need to specify the network over which the TCP/IP sockets get connected: network="..." . It defaults to "lo" .
it was not really clear to me that I must specify the network interface in the preCICE configuration.
I guess it might be (the lack of) pinning or a configuration issue. It is a small system with only few (=16) identical nodes.
At least this problem is solved with Hazel Hen being disassembled.