Question on preCICE multi-node sockets coupling: startup order dependency between acceptor and connector

Ya_Squall · March 31, 2026, 2:22am

A preCICE-based coupled solver is used, with two participants: HOS-Coupling and Near1. The following is an excerpt from precice-config.xml:

<m2n:sockets acceptor="HOS-Coupling" connector="Near1" exchange-directory=".." network="precice_ib0" enforce-gather-scatter="1" />

<coupling-scheme:parallel-explicit>
  <time-window-size value="0.01" />
  <max-time value="100.0" />
  <participants first="Near1" second="HOS-Coupling" />
  ... ... ...
</coupling-scheme:parallel-explicit>

The simulations are run on a two-node cluster (cosrnode01g and cosrnode01), and the network information of the nodes is attached. The execution scenario is as follows: HOS-Coupling runs on cosrnode01g, while Near1 runs on cosrnode01.

The issue I encountered is this: if Near1 is started first and HOS-Coupling is started afterwards, the coupled simulation runs normally. Otherwise, both sides hang at the stage:

preCICE:  Setting up primary communication to coupling partner/s

May I ask how this behavior can be explained? In addition, could you please explain in more detail the roles of acceptor and connector, and their relationship to participants first and second?

ifconfig-cosrnode01.md (2.6 KB)

ifconfig-cosrnode01g.md (3.0 KB)

precice-config.xml (5.0 KB)

ashraf_mohamed · March 31, 2026, 10:40am

The behavior makes sense once you know that preCICE’s acceptor/connector naming is a bit backwards from what you’d expect at the socket level:

connector (Near1) = the server, it opens a port and writes its address to a file in {exchange-directory}/precice-run/
acceptor (HOS-Coupling) = the client, it polls for that file, then connects

So if HOS-Coupling (the client) starts first, it polls forever for a file that Near1 (the server) hasn’t created yet, so the hang at “Setting up primary communication”. There’s no timeout on the polling loop, so it just waits silently.

You can fix this by making sure that Near1 always starts first, or swap the roles:

<m2n:sockets acceptor="Near1" connector="HOS-Coupling" ... />

Now HOS-Coupling is the server and Near1 is the client, just revers which one you start first.

As for <participants first="Near1" second="HOS-Coupling" />, that’s completely unrelated to acceptor/connector. It only controls iteration order within each time window for the explicit scheme.

Ya_Squall · March 31, 2026, 3:18pm

Many thanks for the reply.

If that is the case, and the user must determine the participant startup order based on the definitions of acceptor and connector, then what should be done on an HPC system using a PBS queueing system, where the user cannot control the exact startup order of the programs?

fsimonis · March 31, 2026, 3:58pm

This is wrong. Versions 1 and 2 used confusing naming, which is why we renamed the options to clarify what they do.

The acceptor is the server accepting connections. It writes the endpoint information to a file and waits for connections.
The connector is the client establishing connections. It waits for the file to appear, reads the endpoint information, and then requests a connection.

The m2n tag defines how participants are connected (backend, network, etc.).
The coupling scheme defines which participants are coupled.
Both need to exist, but their order (first, second, acceptor, connector) is independent.

Ya_Squall:

if Near1 is started first and HOS-Coupling is started afterwards, the coupled simulation runs normally. Otherwise, both sides hang at the stage:
preCICE:  Setting up primary communication to coupling partner/s
May I ask how this behavior can be explained?

Primary communication only connects primary ranks, so rank 0 to rank 0.

The working case:
Near1 is the connector, so it actively waits for HOS-Coupling to create the endpoint file. The HOS-Coupling runs and creates the file. Near1 finds the file, it reads the endpoint info and connects to HOS-Coupling rank 0.

This is actually the error-prone case, as Near1 can read old, thus incorrect, endpoint information from a previous run. Near1 cannot detect if these files are incorrect.

The failing case:

So the acceptor HOS-Coupling starts first, creates the endpoint file and waits.
Near1 then fails to find it, fails to read it or fails to connect.
This is very unusual and hard to figure out. Could you enable debug logs and try again?

We have no experience with PBS. Lessons from SLURM are to either request all resources you need in a single job script, or use additional metadata to tell the batch system that it needs to run both jobs at the same time.

ashraf_mohamed · March 31, 2026, 4:23pm

Thank you for the correction, I had the roles completely backwards and I apologize for the confusion this caused.

Just to confirm my understanding now: acceptor = server (writes endpoint file, waits for connections), connector = client (polls for the file, then connects). So HOS-Coupling is the server in Ya_Squall’s setup, which makes the failing case genuinely unexpected and likely environment-specific.

Ya_Squall · April 1, 2026, 2:31am

The working case:
Near1 is the connector, so it actively waits for HOS-Coupling to create the endpoint file. The HOS-Coupling runs and creates the file. Near1 finds the file, it reads the endpoint info and connects to HOS-Coupling rank 0.

This is actually the error-prone case, as Near1 can read old, thus incorrect, endpoint information from a previous run. Near1 cannot detect if these files are incorrect.

At very beginning of the qsub script, I have:

echo "Remove existing precice-run..."
rm -rf ./precice-run

... ...
other commands that follows...

The failing case:

So the acceptor HOS-Coupling starts first, creates the endpoint file and waits.
Near1 then fails to find it, fails to read it or fails to connect.
This is very unusual and hard to figure out. Could you enable debug logs and try again?

So far I am running a released-version of PreCICE-v3.3.0 from within an Apptainer SIF file, will have to recompile to enable DEBUG output.

We have no experience with PBS. Lessons from SLURM are to either request all resources you need in a single job script, or use additional metadata to tell the batch system that it needs to run both jobs at the same time.

Would it be possible to share your SLURM subjob script? but I guess there’s no Guaranty that all the jobs are actually starting at exact same time even you requested it.

fsimonis · April 1, 2026, 9:13am

We have a documentation page regarding this

Ya_Squall · April 1, 2026, 10:47am

Here’re the debug log files.

Near1.log (122.0 KB)

HOS-coupling.log (56.5 KB)

precice-Near1.log (20.8 KB)

precice-HOS-coupling.log (20.3 KB)

Ya_Squall · April 6, 2026, 3:01am

Just for a bit more context, on the same machine, when I run mpirun -np xxx -hostfile nodeNames A-Normal-Foam-Solver -parallel, it works fine regardless of which node I start it from.

fsimonis · April 10, 2026, 2:08pm

This is very strange. I highly suspect that this has to do with the cluster setup.
Please contact your cluster admins and explain the issue to them.

Long-term solution for such problems could be the endpoint exchange via a server, but this will still take some time to develop.

github.com/precice/precice

Server-based Exchange of Communication Endpoints

opened 10:29AM - 06 Nov 19 UTC

fsimonis

enhancement thesis configuration

# Problem preCICE requires a shared filesystem for all participants. This li…mitation is rooted in the fact that we need to exchange endpoints between multiple participants and potentially their individual MPI ranks. This happens during the initialization of the SolverInterface in the [m2n Package](https://xgm.de/precice/docs/develop/structprecice_1_1m2n_1_1BoundM2N.html). Solvers, however, may run in environments that do not fulfill this requirement. A possible scenarios could be isolated containers or VMs that only allow TCP inter-connections. # Proposed Solution We could introduce an alternative to the file-based exchange via an additional server. All participants connect to the server via TCP, which allows them to register and query individual endpoints. The primary goal is not to improve the runtime of the initialization phase, but to remove the constraint of a shared filesystem. A possible implementation could be a custom RESTful server based on python-flask using the following routes: * `/<Requester>/<Acceptor>/master` for exchanging master endpoints * `/<Requester>/<Acceptor>/slave/<Rank>` for exchanging slave endpoints `POST`, `GET` and `DELETE` would correspond to registering, querying and deleting an endpoint. An alternative to a custom server could be a [redis server](https://redis.io/). This is a wide-spread in-memory key-value store and available on most hosting platforms. There a many [C++ libraries](https://redis.io/clients#c--) some of which are based on one of our dependencies `boost.asio`. # Current implementation https://github.com/precice/precice/blob/develop/src/com/ConnectionInfoPublisher.cpp # Work packages 1. Prototype: replace ConnectionInfoPublisher by http calls and use a key-value database server to get this to work 2. Add configuration option to switch between file-based exchange with `exchange-directory` and server-based exchange with `exchange-server` 4. Try various common solutions like redis or valkey and potentially a minimal custom implementation 5. Perform scalability tests and compare to [previous attempts with an MPI server](https://github.com/precice/precice/issues/549#issuecomment-550266743) 6. Document 7. Bonus: Compare against a gather scatter method

Topic		Replies	Views
Questions on MPI support Using preCICE adapters , mpi , fortran	16	453	May 23, 2024
Running preCICE on a Cluster Using preCICE mpi , slurm	17	1276	August 28, 2021
First iteration of serial-explicit with multiple solvers behaves different to subsequent iterations Using preCICE	5	120	February 5, 2026
Two couplings to one interface reloaded Using preCICE multi-coupling , coupling-schemes	11	1640	September 18, 2020
Deadlock Using Three-Solver Explicit Coupling Scheme Using preCICE communication , configuration	9	988	November 7, 2022

Question on preCICE multi-node sockets coupling: startup order dependency between acceptor and connector

Related topics