Running preCICE on a Cluster

Hello,

I am running FSI simulations on a cluster. I use slurm to manage parallel jobs on the nodes. I noticed that I can’t run the coupled simulation if I select more than one node and the two partecipants (SU2 and MBDyn) have to share that node.
Is there a way to use more than one?

Thank you,

Alice

Hi,

Yes, it is possible to use more than one node. Indeed, preCICE has been used in massively large simulations on some of the largest supercomputer you can find. You probably need to choose the correct network adapter. This is described in more detail here: Help! The participants are not finding each other! - #2 by Makis

Point 4 focuses on the choice of the network adapter. Does this solve your problem?

1 Like

No, I tried with other networks but they do not work…

Weird. I would have some further questions for debugging

  • Which networks do you have available and in what sense does it not work? Do you get any error message or do the solvers stop at the beginning of the simulation with some message where they way for the other participant to connect? Could you maybe upload your preCICE configuration and the output of the solvers when they cannot connect with each other?
  • Did you also check the other points in the mentioned post by Makis regarding relative folders, file permissions, and dirty files.

Also be sure to pin the hardware to the mpi threads. Often, you need separate nodes for different participants. Compare this job script:
https://precice.org/installation-special-systems.html#job-script-for-ateles

The default is lo and then I have eht0-1-2-3. When I use the same node for both partecipants the default works. Within this node, I use --ntasks=6 for SU2, whereas MBDyn runs with only 1 task.
Here my precice-config and the job scripts. I have two job scripts because I usually run one partecipant from its folder and the other from its own folder as well.
Do you think I should try to use just one job script which has both commands?
run_MBDyn.txt (428 Bytes)
precice-config.xml (5.4 KB)
run_SU2.txt (549 Bytes)

The strange thing is that in this job script it says that the two partecipants cannot share the node, whereas in my case they have to…

Yes, I think you should both commands in one job script. If you use more than one, there is the risk that your jobs do not even start at the same time. You will also need to set the network device to one of the other eth-X values and make sure that the . Which one I cannot tell. You have to try or ask your system administrator which device is used for network communication.

I think what the script that @uekerman has linked refers to are some special cluster configurations. There have cases where jobs cannot share nodes in the sense that:

  • Solver A runs on nodes 1,2,3…k and Solver B runs on nodes k+1,…,n and this configuration works.
    -If Solver A runs on node 1,2,3…k-1 and has some ranks on k while solver B also some ranks on k, it fails. I think this is more a system configuration.

I am still wondering in what sense running on two nodes does not work. Is there an error message or does the simulation simply not start?

Exactly, it doesn’t start. Both partecipants hang on
(0) 18:57:09 [impl::SolverInterfaceImpl]:253 in initialize: Setting up master communication to coupling partner/s

It seems as if they do not find each other.

Then I would suggest the following:

  • Make sure that both solvers look for precice-run in the same directory.
  • Make sure that precice-run is on a share network drive such that all nodes can access it.
  • Start the simulation on two nodes and test out the different network devices eth0 to eth3. Make sure that you let each simulation run for 20-30 minutes even if it looks that it does not start. Just to make sure there is no problem with a very slow network or network drive. The simulation should normally start much faster, but I have observed such odd behavior in one case.
    • Alternatively, you can ask your system’s administrator about the right network device if this is easier than trying.

Thank you, I’ll try as you suggest and I’ll update you!

I am curious what comes out of this!

If this all fails, on should also look a bit deeper into @uekerman’s suggestion. For that one would need to write a bit fancier SLURM script similar to the examples that were linked.

1 Like

We ran into this issue on various machines.

slurm uses the environment to tell MPI which nodes it can use etc.
So running 2 executables using MPI simultaneously in a slurm job will populate the same nodes.

Here is an artificial example with 1 slot per node running on nodes n0-n5:

mpirun -np 2 A &
mpirun -np 4 B
Nodes n0 n1 n2 n3 n4 n5
A ranks 0 1
B ranks 0 1 2 3

Here, the nodes n4-n5 won’t do anything.

What we found is that some versions of some MPI implementations will tolerate this, some will crash on startup and some will hang on communication build-up.

One solution is to partition the allocated nodes using hostfiles. Sadly each MPI implementation has adopted its own format.
We did not manage to partition the slots within individual nodes, so we have to run using complete nodes.

Example with hostfiles

// parition the session as shown in our script
mpirun -np 2 --hostfile A.hostfile A &
mpirun -np 4 --hostfile B.hostfile B
Nodes n0 n1 n2 n3 n4 n5
A ranks 0 1
B ranks 0 1 2 3

I have been tempted to turn this slurm partitioning into a stand-alone script.
Would there be interest for such a tool?

Yes, why not.
I just want to add to the discussion: Checkout whether your MPI implementation supports omplace and whether it is installed on your cluster. It allows you to map processes to ranks via the command line, which seems to be a very handy approach.

Highlight by me.
Just for my understanding: If I use preCICE on a machine with SLURM, I have to make sure that no node has ranks of two different participants. Does this mean it is a limitation by preCICE?

Is this documented anywhere? I have seen some odd behavior of some simulations on SLURM systems, but I was not able to debug them properly due to cluster configuration and lack of proper reporting.

I also remember seeing a run script for Hazel Hen or so (at least a SLURM system) where the hostfile was created in the SLURM script. I think it was part of the documentation. Does this still exist? I quickly looked in the docs, but could not find it.

This is really interesting. I’m not sure how to find my hostfile format, but I’ll look into it.
My problem is that if I use 1 node for both participants it works (SU2 is in parallel and MBDyn is run in serial). I’ve also tried submitting only one job, but the problem with more than 1 node remains. It’s strange because both participants are freezed but this time the slaves are connected:
(0) 07:10:40 [impl::SolverInterfaceImpl]:285 in initialize: e[0mSlaves are connected

After this message, nothing happens.

That the solver can connect to each other already sounds like good news to me.

Could you compile a Debug build of preCICE? With that you can configure logging verbosity. It would be interesting to see what the solvers do when the simulation is using only one job. In that case it would be interesting to see the full output of both solvers. Maybe this gives some insight at what point the simulation is hanging.

Hello everyone,

In the end it was the network name. After a lot of trial and error, I found the correct IP network name. Now multiple nodes can communicate together!

Thank you for your help.

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.