Slow data transfer in precice setup

Hello preCICE-community!

I am working on coupling our inhouse particle solver (XPS) with OpenFOAM using volumetric coupling. I use the OpenFOAM adapter with openmpi (4 ranks) and domain decomposition on the fluid side - and a single CPU thread + CUDA kernels on the DEM side. I am currently facing a situation that I don’t know how to handle - therefore coming to you to ask for advice.

My config currently (momentum coupling) looks something like this


precice-config.xml (4.3 KB)

As you can see, the number of variables I send are limited and the case I am currently working on is also not large. I am using the same mesh dimensions on both sides (174.000 3D-cells). A DEM timestep is 1e-05s, the fluid-timestep a couplingStep are both 1e-3s. I exchange the following quantities

  <data:vector name="Velocity" />
  <data:vector name="PressureGradient" />
  <data:vector name="ViscousForce" />
  <data:scalar name="FluidDensity" />
  <data:scalar name="FluidViscosity" />

  <data:vector name="MomentumSource" />
  <data:vector name="ParticleVelocity" />
  <data:scalar name="VoidFraction" />

making 5 vectors and 3 scalars. Per coupling timestep that is

174.000cells * 8Byte = 1.32MB per scalar
174.000cells * 8Byte * 3 = 3.98MB per vector

making in total less than 24MB per coupling-timestep.

According to my trace (that i recorded using the precice-tool) the reading of a single scalar (on 174.000 cells) takes ~33ms, a single vector (on 174.000 cells) takes ~35ms.

Nearest neighour mapping should be trivial but takes also 33-39ms depending on constraint and data-type (scalar or vector).

The marked area in the perfetto plot corresponds to about 1 coupling interval. Top solver is the DEM solver with 100 timesteps and bottom is the MPI FluidSolver with 1 timestep.

This test was run on a workstation using a AMD Ryzen 9 7900X 12-Core Processor and Nvidia 4060TI GPU.

This times seem a little our of place and would correspond to ~40MB/s. I am asking you what I am doing wrong to receive such a performance. Are these times for reading/writing/mapping somewhat realistic, given the amount of data I am using.

In case this turns out to be the case, I will need to think about improving performance of my setup (introducing mpi to the DEM solver, tuning cell-sizes, data frequency,…).

Thank you very much in advance for reading through my post and helping out.

Best regards, Andreas

After reading through this paper: 10.12688/openreseurope.14445.2 it seem like the big bottleneck I am facing is the single threaded DEM solver. In this paper you describe the time for nearest-neighbor mapping negligible (Figure 9) and the time for the initial setup inexpensive (<100ms).

The hardware setup used in this paper consists of “two 3.1GHz Intel Xeon Platinum 8174 (SkyLake) processors with a total of 48 cores and 96GB of system memory per node” and 48 and 96 MPI ranks - so a little bit above my specs :sweat_smile:

Using 4 MPI ranks for reading consistent data on the FluidSolver already reduces memory access time for my setup by a factor of about 4: image

I will probably continue to implement some memory pipeline for concurrent memory access and transfer and can share my results in this thread if wanted :slight_smile:

If anyone has experience with single-threaded memory access in preCICE or used a similar setup - please feel free to contact me or answer below this post.

Thanks in advance and best regards,
Andreas

Hi, nice to hear from you after the Workshop!

Let’s start with the fundamentals: Which preCICE version are you using are how did you build it?

Are you exchanging data on the cell centers or the cell vertices? How many vertices do you define in preCICE using setMeshVertices()?

Mapping data is essentially sequential write on one block of data and random read on another block of data. This is generally so fast that we consider it to be not worth looking into.
I once had the idea to add a contiguous mapping, which copies whole blocks, but we didn’t see the need to advance this further as the direct mesh access can be heavily leveraged for this. Add Contiguous Mapping · Issue #489 · precice/precice · GitHub

That said, I am currently comparing NP performance across versions and tweaked the develop version. I can do the same for NN.

This highlighted section looks very suspicious. It looks like the solver performs very fast time steps followed by a long time step in the end. Here the read and write-data functions blow up the runtime.
Are you reading and writing data in all time steps of the solver?

Also note that using synchronize="True" in the profiling generally blows up overall time as it introduces many synchronization points/barriers. This is useful to profile runtime inside a singe parallel solver.

Reading will not interpolate in your scenario, so it should take as long as writing data.

The reference paper is getting a bit outdated now. The most up-to-date publication regarding mapping performance is from @DavidSCN https://epubs.siam.org/doi/pdf/10.1137/24M1663843.
Section 4 contains the experiments with the meshes on page 13 and the runtime comparison in fig 16 on page 21.

Do you mean using preCICE in a multithreaded environment?
preCICE is not thread-safe, if you want to perform thread-local reads and writes to/from preCICE, you need to lock the participant object in some way.
Thread-safe data access could be an interesting feature, but we haven’t investigated this so far.

Opened Thread-safe Participant · Issue #2407 · precice/precice · GitHub regarding this.

Hope that helps!
Best regards
Frédéric

Hi, I was just a little busy but am now full on developing our application using the preCICE-library. And reading the forum-guidelines before making a post could’ve also been nice - Sorry for that! :see_no_evil_monkey:

I am currently using origin/release-v3.2.0 that i built from source.

Both (OpenFOAM and DEM side) write data on cell-centers. On the DEM-side we call participant->setMeshVertices() two times (extensive, intensive grid) with a precice::span<precice::VertexID> that contains 174000 vertices.

Using NP would probably also be fine for us since adding mesh connectivity would not be a big deal on our side. I can run a benchmark using this mapping scheme instead and post the results here.

In the posted test we exchange data once every 100 DEM timesteps. Our DEM and MomentumSolver runs parallel on the GPU and updating all particles takes ~800us per timestep in that case. The last/first timestep is where we read/write data to the FluidSolver via preCICE and for this we need to copy data from VRAM->RAM (those should be the small sections between the “readData” and “writeData” blocks.
DEM has 1e-5s timestep and coupling-TS is 1e-3 in that case.

I am currently not using any interpolation scheme, just nearest neighbor mapping, so yes. I think adding an interpolation scheme will increase mapping-time drastically though.

Thanks for the paper - I’ll definitely give it a read!

I had some thoughts about this and will try to integrate MPI into our solver to utilize m2n mapping of preCICE. We realized that reading on the Fluid-side with 4 MPI ranks already gives a good boost in performance (basically devides time required for a single readData by MPI_SIZE). For this we need to develop a memory pipeline, where multiple MPI threads write on the same GPU-buffer on the DEM side.

Using regular threads for this could also be interesting because the problem with MPI will probably be the following for us: When utilizing MPI-threads only for read/write and copy between CPU ↔ GPU, all except 1 MPI-thread will be idle for the whole time we “just” calculate DEM-timesteps without exchanging data. I need to mention that I have no prior experience using MPI so this might be not a problem at all.

Thanks for your insights and best regards,
Andreas

In this case, I recommend using a production build using the production preset:

cmake --preset=production -DBUILD_TESTING=OFF && make -j 24 -C build

The production-audit preset enables debug messages and assertions which can be practical when developing.

I would a NN mapping of scalar data to be around 10us for this input size.

Depending on your cluster and amount of work you are willing to invest, a hybrid scheme could also make sense. One MPI process per node and then use multi-threading on it. AFAIK one MPI process per GPU is also a common strategy.
Note however, that preCICE is single-threaded by design.

Oh thanks, I didn’t know that!

But then how do you explain the avg. mapping-time of 36ms I get per readData() call such as map.nn.mapData.FromFluidMeshToDEMGrid_Intensive? I guess if the actual mapping is not taking long then something must be blocking the thread I am running my DEM solver on.

Because something seems to be seriously off here:
The average mapping job takes ~268ms for DEM->CFD

And ~388ms for CFD->DEM

I tried your input size, and I am getting around 0.42 ms (median of 44 samples) for mapping a scalar data of size 173889. I’m using GCC 15.2.0 and mold 2.37.1.
Not as fast as I expected, but dramatically more than in your case.

Furthermore, your CPU runs 4.7GHz base and boosts up to 5.6GHz. My cores are fixed to 4GHz, so your timings should be significantly lower.

Can you double-check that preCICE is build in Release mode, ideally with interprocedural optimization/LTO? The production preset enables this.
Also, what compiler are you using at which version?

Thank you for taking the time to benchmark the mapping configuration I am using!

I did now use the cmake --preset=production build and this improved the performance of mapping, reading and writing by a factor of ~100. So thank you very much for suggesting that - I was not aware that I was using a debug build until now.

I see now that the using-presets section in your documentation clearly states to use the production build but I must have overread that when first installing preCICE.

Benchmarks

This breaks down to the following average times:

operation_type avg_duration_ms block_count
Mapping 0.27324313077939233 30280
Reading 0.1969488341968912 18528
Writing (Extensive) 0.507905 1000
Writing (Intensive) 0.34469 3000

Thank you again very much for taking the time to analyze the problem - I appreciate it very much.

Best regards,
Andreas

1 Like

Happy to help!

These times look way more reasonable.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.