getLeaderRank: ERROR: Unknown accessor name

Hi,

I am setting up preCICE on a cluster and in some configurations I get the following error

[com::MPIDirectCommunication]:118 in getLeaderRank: ERROR: Unknown accessor name "BiotSolverMaster"!

and I have no idea where this comes from.

Some quick facts:

  • My simulation has two participants named BiotSolver and HDFlowSolver`.
  • Any of the two solverr might show the error message.
  • Both solvers are implemented in FEniCS. Therefore I use the preCICE-Python bindings.
  • I am still on preCICE 1.6.1 (due to compatibility reasons).
  • Most of the software has been installed Spack (including preCICE), but not FEniCS.
  • The simulation is started via SLURM.
  • I use mpich 3.3.1 with GCC 9.2.0.
  • This problems seems to become worse the more nodes I use. 2 nodes seems to be fine, but already 3 nodes seems to be a problem.
  • Sometimes the simulation seems to hang at the beginning similar to what is described in the documentation. The solutions proposed do not seem to solve my problem.
  • This are the logfiles I get: biotsolver-18921.log (2.2 KB) hdflowsolver-18921.log (1.2 KB)
  • This is my SLURM batch script (1.9 KB). As you can see I am getting desparate.

Could anyone point me in a useful direction for debugging this issue? Is this a network issue or rather a MPI problem?

Thanks in advance!

Best,
Alex

Some observation:

It seems that starting the executable inside the jobscript with

time srun --mpi=pmi2 -n 12 python3 biot_solver.py biot.input > biotsolver-${SLURM_JOBID}.log  2>&1  &

instead of mpirun makes the error getLeaderRank: ERROR: Unknown accessor name disappear.

The hanging of the simulation at the start seems to be preCICE distributing the meshes and computing the meshes (?!). I will monitor this further and report again.

Hi @ajaust,

Your job script indeed looks desparate :see_no_evil:
What is your preCICE config? Do you use MPI or TCP/IP for m2n communication? If TCP/IP which network do you use?
What is missing in your job script is to pin your mpi jobs to nodes like done here.
In the past we also had problems with participants started on the same node.

Thanks! At the moment I can say that starting the job from my jobscript with time srun --mpi=pmi2 -n 12 made the situation a bit more stable, but the simulation still tends to hand in the beginning where both solvers just show [impl::SolverInterfaceImpl]:232 in initialize: Setting up slaves communication to coupling partner/s. It seems to be be a bit random though so I cannot reproduce it properly at the moment. :frowning:

I have attached my precice-config.xml (5.1 KB) for completeness!

I use TCP/IP (sockets) for the communication. I have tried to set the communication interface explicitly to eth0, but I am not sure if this changes much. I did it just now and restarted a setup that would else hang during initialization and it runs now. At the moment, I am not sure whether this is actually related or not. A lot of things happening seem to be a unpredictable at the moment.

Would you recommend trying the MPI communication interface? My libraries and solver a are built on top of MPICH so the support should be better than for OpenMPI.

I have tried to look about information on running preCICE on clusters which led me mainly to the information about the MAC cluster and SuperMUC and CooLMUC. The information about CooLMUC mentioned Slurm so it I assumed it would be easy to run jobs with Slurm.

I did not search for the term “Slurm” which would have led me to the page that you have linked. However, even then I am not sure if would have followed the job script on that page as it is not obvious whether it is so elaborate due to Slurm or due to the configuration of the supercomputer. Maybe we should add some of the information on network troubleshooting page or add a new page to the wiki that collects links to and/or information about running preCICE on clusters.

Is it a problem of Slurm or preCICE when participants share nodes? I tried to get my processes at least grouped by Slurm such that there would be minimal sharing with -m block, but it seems that Slurm does not care about that.

In our experience, sockets are more reliable, but MPI is definitely worth a try.

Is there no infiniband? ib0 or similar? Normally this is crucial, with lo you can only connect to sockets on the same node.

Sounds a lot like it depends on which hardware you get and how the threads are pinned.

Yes, good idea.

No, more of the concrete supercomputer. Was a problem on HazelHen IIRC.

Then I willstick with sockets for the moment.

No Infiniband, but a bit faster than normal standard network. It is an internal cluster of the institute. I want to get some experience with my code + preCICE and MPI before I go to the big machine.
I will see how it is going when specifying the network interface explicitly. The cluster here at the moment is also undergoing heavy maintenance.

From the preCICE documentation

For certain systems, you need to specify the network over which the TCP/IP sockets get connected: network="..." . It defaults to "lo" .

it was not really clear to me that I must specify the network interface in the preCICE configuration.

I guess it might be (the lack of) pinning or a configuration issue. It is a small system with only few (=16) identical nodes.

At least this problem is solved with Hazel Hen being disassembled. :wink: