The simulations are run on a two-node cluster (cosrnode01g and cosrnode01), and the network information of the nodes is attached. The execution scenario is as follows: HOS-Coupling runs on cosrnode01g, while Near1 runs on cosrnode01.
The issue I encountered is this: if Near1 is started first and HOS-Coupling is started afterwards, the coupled simulation runs normally. Otherwise, both sides hang at the stage:
preCICE: Setting up primary communication to coupling partner/s
May I ask how this behavior can be explained? In addition, could you please explain in more detail the roles of acceptor and connector, and their relationship to participants first and second?
The behavior makes sense once you know that preCICE’s acceptor/connector naming is a bit backwards from what you’d expect at the socket level:
connector (Near1) = the server, it opens a port and writes its address to a file in {exchange-directory}/precice-run/
acceptor (HOS-Coupling) = the client, it polls for that file, then connects
So if HOS-Coupling (the client) starts first, it polls forever for a file that Near1 (the server) hasn’t created yet, so the hang at “Setting up primary communication”. There’s no timeout on the polling loop, so it just waits silently.
You can fix this by making sure that Near1 always starts first, or swap the roles:
Now HOS-Coupling is the server and Near1 is the client, just revers which one you start first.
As for <participants first="Near1" second="HOS-Coupling" />, that’s completely unrelated to acceptor/connector. It only controls iteration order within each time window for the explicit scheme.
If that is the case, and the user must determine the participant startup order based on the definitions of acceptor and connector, then what should be done on an HPC system using a PBS queueing system, where the user cannot control the exact startup order of the programs?
This is wrong. Versions 1 and 2 used confusing naming, which is why we renamed the options to clarify what they do.
The acceptor is the server accepting connections. It writes the endpoint information to a file and waits for connections.
The connector is the client establishing connections. It waits for the file to appear, reads the endpoint information, and then requests a connection.
The m2n tag defines how participants are connected (backend, network, etc.).
The coupling scheme defines which participants are coupled.
Both need to exist, but their order (first, second, acceptor, connector) is independent.
Primary communication only connects primary ranks, so rank 0 to rank 0.
The working case:
Near1 is the connector, so it actively waits for HOS-Coupling to create the endpoint file. The HOS-Coupling runs and creates the file. Near1 finds the file, it reads the endpoint info and connects to HOS-Coupling rank 0.
This is actually the error-prone case, as Near1 can read old, thus incorrect, endpoint information from a previous run. Near1 cannot detect if these files are incorrect.
The failing case:
So the acceptor HOS-Coupling starts first, creates the endpoint file and waits.
Near1 then fails to find it, fails to read it or fails to connect.
This is very unusual and hard to figure out. Could you enable debug logs and try again?
We have no experience with PBS. Lessons from SLURM are to either request all resources you need in a single job script, or use additional metadata to tell the batch system that it needs to run both jobs at the same time.
Thank you for the correction, I had the roles completely backwards and I apologize for the confusion this caused.
Just to confirm my understanding now: acceptor = server (writes endpoint file, waits for connections), connector = client (polls for the file, then connects). So HOS-Coupling is the server in Ya_Squall’s setup, which makes the failing case genuinely unexpected and likely environment-specific.
The working case:
Near1 is the connector, so it actively waits for HOS-Coupling to create the endpoint file. The HOS-Coupling runs and creates the file. Near1 finds the file, it reads the endpoint info and connects to HOS-Coupling rank 0.
This is actually the error-prone case, as Near1 can read old, thus incorrect, endpoint information from a previous run. Near1 cannot detect if these files are incorrect.
At very beginning of the qsub script, I have:
echo "Remove existing precice-run..."
rm -rf ./precice-run
... ...
other commands that follows...
The failing case:
So the acceptor HOS-Coupling starts first, creates the endpoint file and waits.
Near1 then fails to find it, fails to read it or fails to connect.
This is very unusual and hard to figure out. Could you enable debug logs and try again?
So far I am running a released-version of PreCICE-v3.3.0 from within an Apptainer SIF file, will have to recompile to enable DEBUG output.
We have no experience with PBS. Lessons from SLURM are to either request all resources you need in a single job script, or use additional metadata to tell the batch system that it needs to run both jobs at the same time.
Would it be possible to share your SLURM subjob script? but I guess there’s no Guaranty that all the jobs are actually starting at exact same time even you requested it.
Just for a bit more context, on the same machine, when I run mpirun -np xxx -hostfile nodeNames A-Normal-Foam-Solver -parallel, it works fine regardless of which node I start it from.