I am running FSI simulations on a cluster. I use slurm to manage parallel jobs on the nodes. I noticed that I can’t run the coupled simulation if I select more than one node and the two partecipants (SU2 and MBDyn) have to share that node.
Is there a way to use more than one?
Yes, it is possible to use more than one node. Indeed, preCICE has been used in massively large simulations on some of the largest supercomputer you can find. You probably need to choose the correct network adapter. This is described in more detail here: Help! The participants are not finding each other! - #2 by Makis
Point 4 focuses on the choice of the network adapter. Does this solve your problem?
Weird. I would have some further questions for debugging
Which networks do you have available and in what sense does it not work? Do you get any error message or do the solvers stop at the beginning of the simulation with some message where they way for the other participant to connect? Could you maybe upload your preCICE configuration and the output of the solvers when they cannot connect with each other?
Did you also check the other points in the mentioned post by Makis regarding relative folders, file permissions, and dirty files.
The default is lo and then I have eht0-1-2-3. When I use the same node for both partecipants the default works. Within this node, I use --ntasks=6 for SU2, whereas MBDyn runs with only 1 task.
Here my precice-config and the job scripts. I have two job scripts because I usually run one partecipant from its folder and the other from its own folder as well.
Do you think I should try to use just one job script which has both commands? run_MBDyn.txt (428 Bytes) precice-config.xml (5.4 KB) run_SU2.txt (549 Bytes)
Yes, I think you should both commands in one job script. If you use more than one, there is the risk that your jobs do not even start at the same time. You will also need to set the network device to one of the other eth-X values and make sure that the . Which one I cannot tell. You have to try or ask your system administrator which device is used for network communication.
I think what the script that @uekerman has linked refers to are some special cluster configurations. There have cases where jobs cannot share nodes in the sense that:
Solver A runs on nodes 1,2,3…k and Solver B runs on nodes k+1,…,n and this configuration works.
-If Solver A runs on node 1,2,3…k-1 and has some ranks on k while solver B also some ranks on k, it fails. I think this is more a system configuration.
I am still wondering in what sense running on two nodes does not work. Is there an error message or does the simulation simply not start?
Exactly, it doesn’t start. Both partecipants hang on (0) 18:57:09 [impl::SolverInterfaceImpl]:253 in initialize: Setting up master communication to coupling partner/s
Make sure that both solvers look for precice-run in the same directory.
Make sure that precice-run is on a share network drive such that all nodes can access it.
Start the simulation on two nodes and test out the different network devices eth0 to eth3. Make sure that you let each simulation run for 20-30 minutes even if it looks that it does not start. Just to make sure there is no problem with a very slow network or network drive. The simulation should normally start much faster, but I have observed such odd behavior in one case.
Alternatively, you can ask your system’s administrator about the right network device if this is easier than trying.
If this all fails, on should also look a bit deeper into @uekerman’s suggestion. For that one would need to write a bit fancier SLURM script similar to the examples that were linked.
slurm uses the environment to tell MPI which nodes it can use etc.
So running 2 executables using MPI simultaneously in a slurm job will populate the same nodes.
Here is an artificial example with 1 slot per node running on nodes n0-n5:
mpirun -np 2 A &
mpirun -np 4 B
Nodes
n0
n1
n2
n3
n4
n5
A ranks
0
1
B ranks
0
1
2
3
Here, the nodes n4-n5 won’t do anything.
What we found is that some versions of some MPI implementations will tolerate this, some will crash on startup and some will hang on communication build-up.
One solution is to partition the allocated nodes using hostfiles. Sadly each MPI implementation has adopted its own format.
We did not manage to partition the slots within individual nodes, so we have to run using complete nodes.
Example with hostfiles
// parition the session as shown in our script
mpirun -np 2 --hostfile A.hostfile A &
mpirun -np 4 --hostfile B.hostfile B
Nodes
n0
n1
n2
n3
n4
n5
A ranks
0
1
B ranks
0
1
2
3
I have been tempted to turn this slurm partitioning into a stand-alone script.
Would there be interest for such a tool?
Yes, why not.
I just want to add to the discussion: Checkout whether your MPI implementation supports omplace and whether it is installed on your cluster. It allows you to map processes to ranks via the command line, which seems to be a very handy approach.
Highlight by me.
Just for my understanding: If I use preCICE on a machine with SLURM, I have to make sure that no node has ranks of two different participants. Does this mean it is a limitation by preCICE?
Is this documented anywhere? I have seen some odd behavior of some simulations on SLURM systems, but I was not able to debug them properly due to cluster configuration and lack of proper reporting.
I also remember seeing a run script for Hazel Hen or so (at least a SLURM system) where the hostfile was created in the SLURM script. I think it was part of the documentation. Does this still exist? I quickly looked in the docs, but could not find it.
This is really interesting. I’m not sure how to find my hostfile format, but I’ll look into it.
My problem is that if I use 1 node for both participants it works (SU2 is in parallel and MBDyn is run in serial). I’ve also tried submitting only one job, but the problem with more than 1 node remains. It’s strange because both participants are freezed but this time the slaves are connected: (0) 07:10:40 [impl::SolverInterfaceImpl]:285 in initialize: e[0mSlaves are connected
That the solver can connect to each other already sounds like good news to me.
Could you compile a Debug build of preCICE? With that you can configure logging verbosity. It would be interesting to see what the solvers do when the simulation is using only one job. In that case it would be interesting to see the full output of both solvers. Maybe this gives some insight at what point the simulation is hanging.