Sharing nodes between participants in slurm

I run coupling with two participants in a parallel-explicit scheme on an HPC cluster at LRZ using Slurm. I use the method suggested in the documentation (SLURM sessions | preCICE - The Coupling Library) to partition the allocated nodes to the two solvers. I’m using OpenMPI and

(
mpirun -n <ntasks1> --hostfile <hostfile1> solver1 &
mpirun -n <ntasks2> --hostfile <hostfile2> solver2 &
wait
)

For load balancing in parallel coupling, I would like to distribute tasks so that both solvers take roughly the same time for one coupling window. However, if I set ntasks1 and ntasks2 such that the two solvers both have some of their tasks on the same node (e.g. allocate 3 nodes of 80 tasks each, give 100 tasks to solver 1 and 140 tasks to solver 2), only one solver is running. The other is blocked completely, doesn’t even start. I can see that the host files contain the expected entries.

When the solvers cleanly divide the nodes (e.g. 80 for solver1, 160 for solver2) they both start and run in parallel as expected.

Is this a known limitation of Slurm or could this be solved with configuration of Slurm? I haven’t found any info on this, probably since it is a niche use case.

Hi @dabele,

A big problem is that many factors play into this. Primarily the SLURM configuration and the MPI Implementation. Sadly there is no standard for hostfiles and MPI implementation vary on how expressive they can be.

Ideally one could say in the hostfile to only use the first n or last n slots on a given host. We haven’t found a solution apart from not sharing and thus undersubscribing a node.

Job farming isn’t a solution either as to my understanding SLURM on clusters generally allocates full nodes to each job.

This means that your system admin is your best change for helping you to solve this issue.

If you have full control over the code of your solvers, then you could use the multiple-data multiple-program paradigm of OpenMPI to start both solvers in a single MPI Comm World.
Then partition the communicator and forward each subcommunicator to the individual solver and preCICE.

mpirun -n 12 ./solverA : -n 24 ./solverB

Best
Frédéric

That would work for us I think. I think Intel MPI supports something similar, but I haven’t tried that.

Are there other precice adapters that handle this case? Is there a convention for how to assign “color” (in MPI_Comm_split) to participants? It could just be a new solver config entry, but maybe there’s a smarter way.

There are many ways to split. If you use different applications, then you can use the program in argv[0] with an if to assign numbers, you can compute a hash, or a cross sum.

Then pass this to MPI comm split.

If you use the same application, then you can pass some identifier via the command line.

//Update: Resolved by using a different Participant constructor to pass the actual MPI Communicator instead of only rank and size.

//Original Post:
I tried to implement splitting the communicator (on both solvers). One of the solvers gets stuck in the constructor of precice::Participant

mpirun is not supported by the MPI on the cluster, so I’m using srun. I have a multi-program configuration like this:

0-63 solver1
64-127 solver2

I allocate one node, 128 tasks. I start both solvers with srun --multi-prog mp.conf.

In each solver, I use MPI_Comm_split, each solver uses a different color argument.

Then I initialize precice, passing the size and rank of the split communicator (so rank 0-63, size 64).

The first solver makes the connections and waits at Setting up primary communication to coupling partner/s. The second solver gets stuck. The callstack:

...
#10 precice::utils::IntraComm::barrier() ()
#11 precice::impl::ParticipantImpl::initializeIntraCommunication() ()
#12 precice::impl::ParticipantImpl::ParticipantImpl(std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, int, int, std::optional<void*>) ()
#13 precice::Participant::Participant(precice::span<char const, 18446744073709551615ul>, precice::span<char const, 18446744073709551615ul>, int, int) ()
...

I tried both m2n::sockets and m2n::mpi.

Also, for some reason, there is no precice-run directory in the working directory, not sure that’s related.

Are you also passing the pointer to the communicator as well as the 5th argument?

One example from our tests:

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.