pyPRECICE slow at setting up secondary communications

padovan3 · April 17, 2024, 3:54am

Hi,

I am using pyPRECICE to exchange volumetric data between a python script and a CFD solver (the python code sends the state vector q in conservative variables, and the CFD solver returns the right-hand side of the Navier-Stokes equation f(q)). The exchange works fine in 2D, but in 3D the step “Setting up secondary communication to coupling partner/s” is painfully slow. In order to debug the problem, I have set up a simple toy exchange where two python scripts exchange volumetric data over a 3D domain: when the two solvers partition the mesh across mpi processes in the exact same way, the overall set up process is fast. However, if the partitions are different then the secondary communications step is really slow. Could you point me in the right direction to solve this problem? I am using precice version 2.5.0. I have attached my precice-config file, as well as a screenshot of the the precice log in case they’re useful.

Thanks so much!

Cheers,
Alberto
pyprecice_files.zip (943.4 KB)

DavidSCN · April 17, 2024, 1:53pm

Hi,

could you explain more in detail what ‘slow’ and ‘fast’ means in your specific case? How many unknowns are you exchanging? Is it the ~40k and~60k in the screenshot on 30 ranks? This shouldn’t be much of a problem. Are you running on a cluster? We had some speedup regarding this in preCICE version 3 (e.g. Optimize building the communication map by davidscn · Pull Request #1830 · precice/precice · GitHub). The most obvious advice before debugging would be to try preCICE version 3.

padovan3 · April 17, 2024, 2:23pm

Hi David,

Thanks for your reply. Here are additional details, and apologies for not providing them sooner. The mesh has size 550 x 140 x 15 (in the x, y and z directions) and I am exchanging 5 flow variables over the whole mesh (i.e., 5.7 million degrees of freedom). In the screenshot, the processors layout was 10 x 2 x 1 for one code and 10 x 3 x 1 in the other. By fast, I mean that setting up all communications takes about 1 second. By slow, I mean that it takes several minutes. Also, yes, I am running on a cluster.

Any thoughts? Switching version would require re-building all our software, so I’d rather exhaust all other possibilities within v2.5.0 before doing that. But if you think there’s no hope, then I’ll try version 3 of course.

Thank you again for your help.

Cheers,
Alberto

fsimonis · April 17, 2024, 6:29pm

Hi,

You may be able to enable the two-level-initialization in your case. This feature still doesn’t cover many edge cases, but you can give it a shot.

<m2n:... use-two-level-initialization="1" />

padovan3 · April 18, 2024, 3:07am

Hi,

Thanks for following up with me. I tried the two-level initialization and it made the set up slightly faster when I use a moderate number of processes (O(30) per solver). However, when using more processes (O(100) per solver), I could no longer see much of a benefit. I have attached a minimal example that shows the issue: the example can be run by launching the shell script run.sh, and I have also added a short README file with a brief description of the different parameters that can be modified to obtain different mesh partitions. Let me know if you have time to run it and see if I am doing anything wrong.

Thanks so much for your time and help.

Best,
Alberto

pyprecice_scripts.zip (10.7 KB)

DavidSCN · April 18, 2024, 6:52am

Last resort would be to use a geometric filter geometric-filter="on-secondary-ranks" or similar. Using no filter should yield the worst performance.
For the two-level-initialization there were similar performance improvements released with version 3. If the performance (should be the initialization only though) is crucial in your case and the filtering doesn’t help, I would upgrade the preCICE version.

fsimonis · April 18, 2024, 8:42am

Hi,

In your example, you are using the nearest-neighbour mapping, which comes with a practically negligible runtime cost.

This leaves as cost factors:

filtering (which should not be a big deal if you use a release build of preCICE)
gathering the mesh on the primary node (which you avoid by using the two-level-initialization)
mesh communication (which you cannot avoid further)
establishing connections

The last part uses the file system as a common denominator, which can be very costly depending on how the infrastructure is set up.
If you have access to a very fast shared network storage, then you can set the exchange-directory of the m2n to it.
In general, your system admin should be able to help you.

To really see what is going on, you can use the events2trace script after the case finished to have a look where time is spent in preCICE.

Here is a link to the old documentation.

padovan3 · April 19, 2024, 2:42pm

Hi,

Thank you both for your replies. I’ve tried the two-level initialization, as well as all the possible geometric filters as Davis suggested, but nothing is working. I have found that the communications set up is fast if the two solvers partition the mesh in the following ways Npcx x 1 x 1 or 1 x Npcy x 1 (where Npcx and Npcy are the number of processors along the x and y directions). Why would a partition like Npcx x Npcy x 1 lead to such a severe increase in cost? Could this be something related to pyprecice and mpi4py? (I know folks who had wonderful experiences with version 2.5.0 and 3D meshes with highly asymmetric partitions, so I am a bit puzzled.) In any case, I am going to install version 3.0 and see if I can get an improvement.

Best,
Alberto

fsimonis · April 19, 2024, 3:09pm

Another reason could be the combination of your mesh (partitioned unit-square) and the defined safety-factor="0.1".

Depending on the rank size, the mesh size will become smaller than the safety-factor, which essentially means that you have a full Npcx \times Npcy communication network, which needs to be established.
The filesystem can easily struggle here.

There is barely any documentation on this on our website. I opened an issue in the website repo.

As a bonus, you will get way better performance analysis tools in v3.

Best,
Frédéric

padovan3 · April 19, 2024, 3:28pm

Hi Frederic,

Thanks! Where can I find a high-level description of what the algorithm is doing so that I can get a better sense of what the impact of all the flags is? Also, how would you partition a 3D cartesian domain to minimize communications?

Cheers,
Alberto

fsimonis · April 19, 2024, 5:46pm

This is very unlikely.

You should be able to find the general idea in the dissertation of @uekerman

We employ AABBs to describe partitions and scale them using the safety factor and the longest edge. So ideal for keeping communication partners low are (matching) axis-aligned cubes or a simple rectilinear grids on both participants.

Best,
Frédéric

fsimonis · April 22, 2024, 9:28am

Follow-up regarding the provided case:

I ran your case in various configurations on my system and the runtime is dominated by the generation of “ownership data”. This is what @DavidSCN referred to in his post and version 3 brings significant gains here as you can see in the last example.

Setup: preCICE v2.5.1/v3.1.1 (Release + PRECICE_RELEASE_WITH_ASSERTION, no IPO) with clang 17.0.6 and mold linker 2.30.0 on AMD 5900X with 32GB of memory, using the loopback interface. Setup run on a tmpfs (in memory). I tried different filters, 2LI, and different safety factors.

I used the event2trace script to produce the traces of the following cases and visualized them using ui.perfetto.dev.

Provided case

Case as provided by you. Only change is changing the interface.

Default settings

Provided case with default safety-factor and geometric filter (I remove them):

trace-stock.json.txt (6.8 MB)

No safety-factor

Provided case with safety-factor of 0 and no geometric filter

Two-level initialization

Provided case with default safety-factor and geometric filter and two-level-initialization:

V3 example

I ported your case to preCICE v3 and the result looks like this.

pypyrecice_v3.zip (301.8 KB)

The entire initialization is now ~2s coming from ~35s in v2.

trace-v3.json.txt (4.1 MB)

Note that the v3 version of the python bindings trade performance for safety and usability. Reading the expensive rhoVW in solver2 for example takes 50ms in python, but only 3ms in preCICE. This inflates the time spent in the solver, which inflates the time the solver needs to wait for each other.

In practise your solve step should dwarf the read/write in terms of computational cost.

Topic		Replies	Views
Communcation over sockets work for MPI-Fluid solver with 2 Procs, not for MPI-Fluid solver with 4 Procs Using preCICE configuration , calculix , fsi	4	395	July 6, 2021
Fluid-structure coupling: openfoam has a very short computation time and a long data exchange and mapping time Using preCICE openfoam , data-mapping , calculix , fsi	2	170	March 26, 2024
[Bug] Socket communication directory path Using preCICE configuration	8	540	August 15, 2021
Slower performance on cluster than local machine Using preCICE performance	4	237	May 30, 2023
Simulation stuck at advance for both solvers Using preCICE mpi , python	30	1525	May 4, 2021