Simulation stuck at advance for both solvers

Hello everyone!

I have recently migrated my coupling from preCICE version 1.6.1 to 2.2.0 I’m using the Python bindings and my two coupled solvers are the CSD solver CAMRAD II (which runs on a single process) and the CFD solver TAU (which runs on multiple processes that communication with OpenMPI). This is my preCICE config file:

precice-config.xml (21.3 KB)

After the migration, when I started my simulation and after one timestep von CAMRAD side and one timestep on TAU side, my simulation seems to be stuck with both solvers having called advance() before. Here is my output log for CAMRAD:

camrad_log.txt (4.9 KB)

And my output log for TAU for one of the slaves ranks at the end:

tau_log.txt (58.4 KB)

For me it looks like it is an issue that is related to OpenMPI since my slave rank is trying to pass data to the master ranks and hangs. What did I do wrong for such a behavior to happen? I compiled preCICE with the same version of OpenMPI that TAU was compiled with.

Any help would be much appreciated. Thank you a lot!

1 Like

I have experienced similar issues previously with OpenFOAM, but this was the issue: I had compiled preCICE with Spack, but using a different MPI version than OpenFOAM. Is it possible that the compiler picks another MPI version while compiling, or that you are using a different MPI version to start the simulation?

Note that in your precice-config.xml you have:

<m2n:sockets from="TAU" to="CAMRAD" enforce-gather-scatter="1" exchange-directory="../" network="ib0"/>

The network="ib0" means “Infiniband”, which is needed on SuperMUC. When on another system that does not use Infiniband, you can/should skip the network attribute (see documenation). Where are you running this?

1 Like

To me it sounds that (unknowingly/unintentional) mixing of MPI versions as mentioned by Makis is quite likely. This often leads to weird behavior like hanging without any useful errors.

I ended up with a similar problem some using some FEniCS solver in Python. The problem basically boiled down to mpi4py being imported explicitly by me or by some packages (FEniCS) that uses MPI already.

In my case importing mpi4py before or after FEniCS or just dropping it from my file gave completely different behaviors of my code. Depending on the order and what modules I loaded in my Python file, different MPI libraries where loaded and or different instances of MPI were created at runtime. I never debugged into the details.

  1. Check your mpi4py library or libraries in case there is more than one installed and what MPI version they are linked against.
  2. Check what version of MPI your solver is linked against.
  3. Check what version of MPI your preCICE installation is linked against.
  4. Check if you explicitly import mpi4py in your Python code and if so try to comment it out and use the MPI interface your solver package (TAU) is using.
1 Like

Thank you for your suggestions!

I have recompiled OpenMPI and preCICE and made sure that my solver and preCICE are compiled with the same version of OpenMPI and are also running the same version during the simulation. As far as I can see, nothing has changed.

I am running my simulation on a university cluster, honestly I don’t know exactly what network standard is used for the cluster but this worked before the migration and removing the network attribute didn’t fix the problem either.

Thank you for your reply!

That is really interesting that importing mpi4py give completely different behaviours. I will investigate into it today and come back to you with results!

@Makis Thank you for replying!

@Beini_Ma, the cluster that the cluster uses Infiniband.

This could be related to the gcc compiler used to compile mpi4py, TAU and preCICE.
I remember having similar problem.
Check the file mpi.cfg before compiling: Check the paths for mpicc and mpixx
@Beini_Ma , Could you please check my configuration for mpi4py, Maybe It can help.
You will find my mpi.cfg in …/00_Software/mpi4py/mpi.cfg

Sorry for the late update, that is the current state right now:

I removed mpi4py completely from my code and uninstalled the module and the problem still didn’t disappear.

When running the simulation in serial, the problem does not occur which makes me think that the problem is related to OpenMPI - maybe I made a mistake when installing preCICE.

Could you check which mpi4py is left? I think it is a dependency for the preCICE Python bindings. If you still use the bindings some mpi4py` is installed somewhere.

Another idea that came to my mind is the directory that you use as exchange directory.

  • Can all ranks access the directory?
  • Could you delete the precice-run directory in there in case it still exists?
  • Is the exchange directory a network drive that is slow? I have the problem on some machine that initializing preCICE takes 10-15 minutes because the shared drive is really slow.

I tried using the mpi4py versions 3.0.0 and 3.0.3 and also just removing mpi4py completely (even as an import in the python bindings) which was a suggestion that Benjamin Rodenberg made. About your other ideas:

All ranks can access the directory.
I have tried manually removing the precice-run directory when restarting the simulation. Actually, I try to do that most of the time if I don’t forget.
As far as I know the network drive is pretty fast, at least I haven’t had issues with it before. I ran the simulation before the migration on the same setup.

That is weird. Did you try to break your problem down to some smaller test code that only contains preCICE, Python and MPI? This might help to ensure that your installed preCICE version works at least.

You could try if any of the tutorials supports parallel runs, uses Python, and is easy enough for a quick test. If all fails, I could dig through my codes. I might have dummy test code that uses preCICE, Python and MPI.

I haven’t but that is a good suggestion and I’ll give it a try.

Since my last post, I’ve tried different things without success: I’ve reinstalled a PETSc version that was working in the pre-migrated code and I’ve tried to use a PETSc configuration that has worked earlier. I still think that my issue is linked to OpenMPI but I’m running out of ideas what to try. Right now I’m trying to install preCICE without PETSc and OpenMPI and see how that goes.

Installing preCICE without PETSc did not really help.
@Beini_Ma Could you please upload your log files and share with us the steps that you followed for the installation.

Hello everyone,

I am really grateful for all you guys helping me! Just a quick update from me. I tried installing preCICE without PETSc and it didn’t change the outcome.

I have tried running the simulation with different numbers of processes on TAU side. I uploaded the logs of my most recent run with 4 processes on TAU side. TAU tries to pass data to CAMRAD and passes 51 data sets to preCICE. But on CAMRAD side I only receive 24 data sets, exactly the 24 data sets that were calculated by rank 0.

camrad_log.txt (754.4 KB)
https://file.io/nhjdGlei5bVV (upload of my file because it was too large for the forum)

I’m actually quite puzzled by this. As far as I understood, preCICE correctly received the data sets from my TAU adapter. I am using a gather-scatter distribution scheme, so all my data sets should have been passed from rank 0. For what reason does only the data from my rank 0 arrive on my CAMRAD adapter side?

Thank you again,
Beini

@Beini_Ma you could also export the meshes that preCICE uses and see if it actually receives the complete data. If yes, then the problem should only be in the Python bindings (as we currently expect) or the adapter. It may also help understand a bit more what is going on.

Actually, my meshes on TAU side are being exported. Does this mean something is wrong on the TAU side of the adapter?

Do you mean that you can see (updated) data on both the TAU and CAMRAD II sides on the exported vtk files? Do you see the written data but not the read data maybe?

Oh sorry, my mistake. My meshes on TAU side are not being exported. That probably means that my TAU side has never finished passing all the meshes to preCICE?

Hi @Beini_Ma,

any new observations? Could you maybe make a simple schematic picture (even just on a piece of paper) of how you many processes/partitions each solver has, which processes export which messes, and which processes are hanging when?

I still believe the issue is in a combination of TAU + Python bindings + mpi4py + the MPI you are using, but such a figure could help understand the situation.

Hello!

I constructed a dummy example similar to my coupling with a serial first solver and a parallel second solver and it worked with my installation of preCICE.

When trying to answer your question, I’ve actually realized something in my log files. My data that gets sent from TAU to CAMRAD is split like this:

What I’ve realized is that in my log files, it is written that all my 51 data sets have been written on TAU side which would mean 4200 data points for each Moment and Forces data and 25 data points for my Azimuths data but in the logs I can only find 26 data sets that we sent with 4200 data points, precisely the data sets which were received in CAMRAD side. So for some reason, it seems like my TAU adapter thinks that he has written all 51 data sets but some are still missing. Do you have an idea what could have caused this?

Thank you for your help!

Beini