Simulation stuck at advance for both solvers

Beini_Ma · April 9, 2021, 12:41pm

Hello everyone!

I have recently migrated my coupling from preCICE version 1.6.1 to 2.2.0 I’m using the Python bindings and my two coupled solvers are the CSD solver CAMRAD II (which runs on a single process) and the CFD solver TAU (which runs on multiple processes that communication with OpenMPI). This is my preCICE config file:

precice-config.xml (21.3 KB)

After the migration, when I started my simulation and after one timestep von CAMRAD side and one timestep on TAU side, my simulation seems to be stuck with both solvers having called advance() before. Here is my output log for CAMRAD:

camrad_log.txt (4.9 KB)

And my output log for TAU for one of the slaves ranks at the end:

tau_log.txt (58.4 KB)

For me it looks like it is an issue that is related to OpenMPI since my slave rank is trying to pass data to the master ranks and hangs. What did I do wrong for such a behavior to happen? I compiled preCICE with the same version of OpenMPI that TAU was compiled with.

Any help would be much appreciated. Thank you a lot!

Makis · April 10, 2021, 10:32am

I have experienced similar issues previously with OpenFOAM, but this was the issue: I had compiled preCICE with Spack, but using a different MPI version than OpenFOAM. Is it possible that the compiler picks another MPI version while compiling, or that you are using a different MPI version to start the simulation?

Note that in your precice-config.xml you have:

<m2n:sockets from="TAU" to="CAMRAD" enforce-gather-scatter="1" exchange-directory="../" network="ib0"/>

The network="ib0" means “Infiniband”, which is needed on SuperMUC. When on another system that does not use Infiniband, you can/should skip the network attribute (see documenation). Where are you running this?

ajaust · April 12, 2021, 6:45am

To me it sounds that (unknowingly/unintentional) mixing of MPI versions as mentioned by Makis is quite likely. This often leads to weird behavior like hanging without any useful errors.

I ended up with a similar problem some using some FEniCS solver in Python. The problem basically boiled down to mpi4py being imported explicitly by me or by some packages (FEniCS) that uses MPI already.

In my case importing mpi4py before or after FEniCS or just dropping it from my file gave completely different behaviors of my code. Depending on the order and what modules I loaded in my Python file, different MPI libraries where loaded and or different instances of MPI were created at runtime. I never debugged into the details.

Check your mpi4py library or libraries in case there is more than one installed and what MPI version they are linked against.
Check what version of MPI your solver is linked against.
Check what version of MPI your preCICE installation is linked against.
Check if you explicitly import mpi4py in your Python code and if so try to comment it out and use the MPI interface your solver package (TAU) is using.

Beini_Ma · April 12, 2021, 7:09am

Thank you for your suggestions!

I have recompiled OpenMPI and preCICE and made sure that my solver and preCICE are compiled with the same version of OpenMPI and are also running the same version during the simulation. As far as I can see, nothing has changed.

I am running my simulation on a university cluster, honestly I don’t know exactly what network standard is used for the cluster but this worked before the migration and removing the network attribute didn’t fix the problem either.

Beini_Ma · April 12, 2021, 7:12am

Thank you for your reply!

That is really interesting that importing mpi4py give completely different behaviours. I will investigate into it today and come back to you with results!

Amine · April 12, 2021, 7:22am

@Makis Thank you for replying!

@Beini_Ma, the cluster that the cluster uses Infiniband.

Amine · April 12, 2021, 5:30pm

This could be related to the gcc compiler used to compile mpi4py, TAU and preCICE.
I remember having similar problem.
Check the file mpi.cfg before compiling: Check the paths for mpicc and mpixx
@Beini_Ma , Could you please check my configuration for mpi4py, Maybe It can help.
You will find my mpi.cfg in …/00_Software/mpi4py/mpi.cfg

Beini_Ma · April 19, 2021, 7:53am

Sorry for the late update, that is the current state right now:

I removed mpi4py completely from my code and uninstalled the module and the problem still didn’t disappear.

When running the simulation in serial, the problem does not occur which makes me think that the problem is related to OpenMPI - maybe I made a mistake when installing preCICE.

ajaust · April 19, 2021, 3:32pm

Could you check which mpi4py is left? I think it is a dependency for the preCICE Python bindings. If you still use the bindings some mpi4py` is installed somewhere.

Another idea that came to my mind is the directory that you use as exchange directory.

Can all ranks access the directory?
Could you delete the precice-run directory in there in case it still exists?
Is the exchange directory a network drive that is slow? I have the problem on some machine that initializing preCICE takes 10-15 minutes because the shared drive is really slow.

Beini_Ma · April 19, 2021, 5:19pm

I tried using the mpi4py versions 3.0.0 and 3.0.3 and also just removing mpi4py completely (even as an import in the python bindings) which was a suggestion that Benjamin Rodenberg made. About your other ideas:

All ranks can access the directory.
I have tried manually removing the precice-run directory when restarting the simulation. Actually, I try to do that most of the time if I don’t forget.
As far as I know the network drive is pretty fast, at least I haven’t had issues with it before. I ran the simulation before the migration on the same setup.

ajaust · April 21, 2021, 9:02am

That is weird. Did you try to break your problem down to some smaller test code that only contains preCICE, Python and MPI? This might help to ensure that your installed preCICE version works at least.

You could try if any of the tutorials supports parallel runs, uses Python, and is easy enough for a quick test. If all fails, I could dig through my codes. I might have dummy test code that uses preCICE, Python and MPI.

Beini_Ma · April 21, 2021, 2:23pm

I haven’t but that is a good suggestion and I’ll give it a try.

Since my last post, I’ve tried different things without success: I’ve reinstalled a PETSc version that was working in the pre-migrated code and I’ve tried to use a PETSc configuration that has worked earlier. I still think that my issue is linked to OpenMPI but I’m running out of ideas what to try. Right now I’m trying to install preCICE without PETSc and OpenMPI and see how that goes.

Amine · April 22, 2021, 3:54pm

Installing preCICE without PETSc did not really help.
@Beini_Ma Could you please upload your log files and share with us the steps that you followed for the installation.

Beini_Ma · April 23, 2021, 8:52pm

Hello everyone,

I am really grateful for all you guys helping me! Just a quick update from me. I tried installing preCICE without PETSc and it didn’t change the outcome.

I have tried running the simulation with different numbers of processes on TAU side. I uploaded the logs of my most recent run with 4 processes on TAU side. TAU tries to pass data to CAMRAD and passes 51 data sets to preCICE. But on CAMRAD side I only receive 24 data sets, exactly the 24 data sets that were calculated by rank 0.

camrad_log.txt (754.4 KB)
https://file.io/nhjdGlei5bVV (upload of my file because it was too large for the forum)

I’m actually quite puzzled by this. As far as I understood, preCICE correctly received the data sets from my TAU adapter. I am using a gather-scatter distribution scheme, so all my data sets should have been passed from rank 0. For what reason does only the data from my rank 0 arrive on my CAMRAD adapter side?

Thank you again,
Beini

Makis · April 24, 2021, 9:33am

@Beini_Ma you could also export the meshes that preCICE uses and see if it actually receives the complete data. If yes, then the problem should only be in the Python bindings (as we currently expect) or the adapter. It may also help understand a bit more what is going on.

Beini_Ma · April 24, 2021, 11:25am

Actually, my meshes on TAU side are being exported. Does this mean something is wrong on the TAU side of the adapter?

Makis · April 24, 2021, 2:45pm

Do you mean that you can see (updated) data on both the TAU and CAMRAD II sides on the exported vtk files? Do you see the written data but not the read data maybe?

Beini_Ma · April 24, 2021, 4:24pm

Oh sorry, my mistake. My meshes on TAU side are not being exported. That probably means that my TAU side has never finished passing all the meshes to preCICE?

Makis · April 27, 2021, 6:39pm

Hi @Beini_Ma,

any new observations? Could you maybe make a simple schematic picture (even just on a piece of paper) of how you many processes/partitions each solver has, which processes export which messes, and which processes are hanging when?

I still believe the issue is in a combination of TAU + Python bindings + mpi4py + the MPI you are using, but such a figure could help understand the situation.

Beini_Ma · April 28, 2021, 5:15pm

Hello!

I constructed a dummy example similar to my coupling with a serial first solver and a parallel second solver and it worked with my installation of preCICE.

When trying to answer your question, I’ve actually realized something in my log files. My data that gets sent from TAU to CAMRAD is split like this:

What I’ve realized is that in my log files, it is written that all my 51 data sets have been written on TAU side which would mean 4200 data points for each Moment and Forces data and 25 data points for my Azimuths data but in the logs I can only find 26 data sets that we sent with 4200 data points, precisely the data sets which were received in CAMRAD side. So for some reason, it seems like my TAU adapter thinks that he has written all 51 data sets but some are still missing. Do you have an idea what could have caused this?

Thank you for your help!

Beini

Topic		Replies	Views
Communcation over sockets work for MPI-Fluid solver with 2 Procs, not for MPI-Fluid solver with 4 Procs Using preCICE configuration , calculix , fsi	4	396	July 6, 2021
MPI runs: Sharing all ranks between two solvers Using preCICE mpi , parallel	4	37	November 4, 2024
Python sover: multiple MPI initializations Using preCICE openfoam , mpi , python , fsi	6	877	March 1, 2021
MPICH or OpenMPI Installing preCICE openfoam	5	1079	January 20, 2021
Issue with solver dummy Official adapters and tutorials mpi	3	365	January 4, 2023

Simulation stuck at advance for both solvers

Related topics