Simulation stuck at advance for both solvers

So, does this mean that for any reason the other side is still waiting for more data because the adapter does not write everything needed? Normally this should not happen, as the buffers should already be initialized in the beginning. But is it hanging in the beginning actually?

CAMRAD seems to be waiting since not all data sets are sent. On TAU side it looks like the problem is not that the data for the missing 24 data sets are incomplete, they have not been sent at all. But im also getting this log line:

(0) 22:24:31 [cplscheme::BaseCouplingScheme]:83 in sendData: e[0mNumber of sent data sets = 51

Does this not mean that TAU thinks that he has sent all the data sets? But at the same time, it has only been logged how the 26 + 1 data sets (which were successfully received on CAMRAD side) have been sent. What could have happened here?

Hi @Beini_Ma,

I am still puzzled about what is going on so I set up some solver dummies for you to rule out any side effects coming from your solvers. You can find them here:

This includes a solver dummy for C++ and one for Python that you can combine with each other as explained in the README. Could you try the 3 setups explained in the README and test if that works for you? The dummies do not do anything interesting, but it would be important to see if they also hang.

3 Likes

That would be an excellent debugging idea! Thank you for making these examples, @ajaust! :hugs:

Good morning!

First, let me say that I want to thank you so much for your dummy example!

I just ran it, but before I had to adjust the Python code a bit to run for Python2 (just changing some strings to unicode). For the Python-Python-coupling, it runs successfully if I increase the number of vertices so that every rank is ensured to have a non-empty mesh. If I ran the dummy with only 1 vertex in the mesh and in parallel (for a certain solver), then Im getting this error:

python: /scratch/ge69puj/00_Software/precice/dependencies/eigen-3.3.7/Eigen/src/Core/DenseCoeffsBase.h:180: Eigen::DenseCoeffsBase<Derived, 0>::CoeffReturnType Eigen::DenseCoeffsBase<Derived, 0>::operator()(Eigen::Index) const [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 0>::CoeffReturnType = const double&; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.
[TULRHST-HT140:23475] *** Process received signal ***
[TULRHST-HT140:23475] Signal: Aborted (6)
[TULRHST-HT140:23475] Signal code:  (-6)
[TULRHST-HT140:23475] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f16b9e53390]
[TULRHST-HT140:23475] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f16b939d428]
[TULRHST-HT140:23475] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f16b939f02a]
[TULRHST-HT140:23475] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2dbd7)[0x7f16b9395bd7]
[TULRHST-HT140:23475] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2dc82)[0x7f16b9395c82]
[TULRHST-HT140:23475] [ 5] /scratch/ge25yiz/Migration/source_precice/precice-2.2.0/lib/libprecice.so.2(_ZNK5Eigen15DenseCoeffsBaseINS_6MatrixIdLin1ELi1ELi0ELin1ELi1EEELi0EEclEl+0x48)[0x7f16a1f337ae]
[TULRHST-HT140:23475] [ 6] /scratch/ge25yiz/Migration/source_precice/precice-2.2.0/lib/libprecice.so.2(_ZN7precice7mapping22NearestNeighborMapping3mapEii+0xf2d)[0x7f16a217a0f5]
[TULRHST-HT140:23475] [ 7] /scratch/ge25yiz/Migration/source_precice/precice-2.2.0/lib/libprecice.so.2(_ZN7precice4impl19SolverInterfaceImpl11mapReadDataEv+0x54b)[0x7f16a232b447]
[TULRHST-HT140:23475] [ 8] /scratch/ge25yiz/Migration/source_precice/precice-2.2.0/lib/libprecice.so.2(_ZN7precice4impl19SolverInterfaceImpl10initializeEv+0x122d)[0x7f16a230423f]
[TULRHST-HT140:23475] [ 9] /scratch/ge25yiz/Migration/source_precice/precice-2.2.0/lib/libprecice.so.2(_ZN7precice15SolverInterface10initializeEv+0x20)[0x7f16a22dafc8]
[TULRHST-HT140:23475] [10] /home/HT/ge25yiz/migration/venv/lib/python2.7/site-packages/pyprecice-2.2.0.2-py2.7-linux-x86_64.egg/cyprecice.so(+0x1339d)[0x7f16a294339d]
[TULRHST-HT140:23475] [11] python(PyEval_EvalFrameEx+0x5520)[0x4b5210]
[TULRHST-HT140:23475] [12] python(PyEval_EvalCodeEx+0x7fc)[0x4b8e8c]
[TULRHST-HT140:23475] [13] python(PyEval_EvalCode+0x19)[0x4b8f99]
[TULRHST-HT140:23475] [14] python(PyRun_FileExFlags+0x131)[0x4e12c1]
[TULRHST-HT140:23475] [15] python(PyRun_SimpleFileExFlags+0xdf)[0x4e2d7f]
[TULRHST-HT140:23475] [16] python(Py_Main+0xbe0)[0x416080]
[TULRHST-HT140:23475] [17] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f16b9388830]
[TULRHST-HT140:23475] [18] python(_start+0x29)[0x4151e9]
[TULRHST-HT140:23475] *** End of error message ***

Is this the intended behavior? Could this be the problem that makes my simulation hang?

Thank you again for you guys helping me. I appreciate it a lot.

Beini

I think I have misunderstood how your code works. It shouldn’t matter how the number of vertices are defined, every rank is always ensured to have the amount of vertices that were defined. Oddly enough this only happens in the Python-Python coupling and not in the cpp-cpp coupling or the cpp-Python coupling.

I am not sure if it matter, but did you also try Python-cpp? This would be the fourth option that I had not mentioned in the README.

Yes, I have tried it and it also worked. So to sum it up:

I have tried around a bit with the number of vertices and the number of ranks for the two solvers. The results are really strange:

A coupling that is not Python-Python will always work.

If the first solver has more ranks than the second solver, it seems to be always working.
If the second solver has more ranks than the first solver, then sometimes I get the Eigen error.

For instance defining the number of vertices = 1 and using a serial first solver and a parallel second solver will always fail.

But for number of vertices = 2, oddly enough using a serial first solver and a second solver with size = 3 will not fail. It will fail for second solver with size = 4 though.

Do you have any idea what could cause this behavior to happen?

Edit:

It seems like I was wrong. The problem also occurs if the first solver is using the cpp adapter and the second solver is using the python adapter. We could sum up that this problem can happen if the second solver used is using the python bindings and has more ranks than the first solver.

Hello everyone,

I think I have solved the problem. It seems to me it’s a problem in the preCICE source code. In the BaseCouplingScheme, the if-clause in line 78 should be deleted or changed, otherwise the coupling will not work when using gather-scatter-communication and if the master rank does not have any data points of some data sets. If we go into the source code of the GatherScatterCommunication, we can see that the master rank is required to enter this method for the data to be sent at line 106.

Please correct me if I’m wrong about this but my simulation completes the iteration loop on the TAU side and continues on CAMRAD side if I take out this if-clause.

Again, thank you all for your help!

Beini

3 Likes

Great to hear that it seems to be solved! :slight_smile:

Maybe the developers could check if your observations are correct. Could you maybe open an issue on Github and report your findings? That would be greatly appreciated!

Edit: I just wanted to add that I can reproduce your behavior when enforcing gather-scatter communication. Running cpp-Python and Python-cpp with 4 ranks led to my Python solver to crash with what looks like the same error. cpp-cpp and Python-Python work though. Now my assumption would be that there might be something wrong with the MPI communicator when one mixes Python and cpp.

There is some output of other ranks in between, but the major error message should be clear:

python3: /usr/include/eigen3/Eigen/src/Core/DenseCoeffsBase.h:180: Eigen::DenseCoeffsBase<Derived, 0>::CoeffReturnType Eigen::DenseCoeffsBase<Derived, 0>::operator()(Eigen::Index) const [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 0>::CoeffReturnType = const double&; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.
[lapsgs24:11937] *** Process received signal ***
[lapsgs24:11937] Signal: Aborted (6)
[lapsgs24:11937] Signal code:  (-6)
[lapsgs24:11937] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f0a1e9ed040]
[lapsgs24:11937] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f0a1e9ecfb7]
[lapsgs24:11937] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f0a1e9ee921]
[lapsgs24:11937] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x3048a)[0x7f0a1e9de48a]
[lapsgs24:11937] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30502)[0x7f0a1e9de502]
[lapsgs24:11937] [ 5] /home/jaustar/software/precice/2.2.0-debug-mpi-petsc-python/lib/libprecice.so.2(_ZNK5Eigen15DenseCoeffsBaseINS_6MatrixIdLin1ELi1ELi0ELin1ELi1EEELi0EEclEl+0x48)[0x7f09fb5ce2d4]
[lapsgs24:11937] [ 6] preCICE: Compute read mapping from mesh "MeshOne" to mesh "MeshTwo".
preCICE: Compute read mapping from mesh "MeshOne" to mesh "MeshTwo".
preCICE: Compute read mapping from mesh "MeshOne" to mesh "MeshTwo".
/home/jaustar/software/precice/2.2.0-debug-mpi-petsc-python/lib/libprecice.so.2(_ZN7precice7mapping22NearestNeighborMapping3mapEii+0xf0c)[0x7f09fb80be62]
[lapsgs24:11937] [ 7] preCICE: Mapping distance min:0 max:0 avg: 0 var: 0 cnt: 1
preCICE: Mapping distance min:0 max:0 avg: 0 var: 0 cnt: 1
preCICE: Mapping distance min:0 max:0 avg: 0 var: 0 cnt: 1
preCICE: iteration: 1 of 2, time-window: 1 of 2, time: 0, time-window-size: 1, max-timestep-length: 1, ongoing: yes, time-window-complete: no, write-iteration-checkpoint 
/home/jaustar/software/precice/2.2.0-debug-mpi-petsc-python/lib/libprecice.so.2(_ZN7precice4impl19SolverInterfaceImpl11mapReadDataEv+0x54b)[0x7f09fba63c0b]
[lapsgs24:11937] [ 8] preCICE: iteration: 1 of 2, time-window: 1 of 2, time: 0, time-window-size: 1, max-timestep-length: 1, ongoing: yes, time-window-complete: no, write-iteration-checkpoint 
preCICE: iteration: 1 of 2, time-window: 1 of 2, time: 0, time-window-size: 1, max-timestep-length: 1, ongoing: yes, time-window-complete: no, write-iteration-checkpoint 
DUMMY (1): Writing iteration checkpoint
DUMMY (0): Writing iteration checkpoint
/home/jaustar/software/precice/2.2.0-debug-mpi-petsc-python/lib/libprecice.so.2(_ZN7precice4impl19SolverInterfaceImpl10initializeEv+0x122f)[0x7f09fba3c93b]
[lapsgs24:11937] [ 9] DUMMY (0): Advancing in time
DUMMY (1): Advancing in time
/home/jaustar/software/precice/2.2.0-debug-mpi-petsc-python/lib/libprecice.so.2(_ZN7precice15SolverInterface10initializeEv+0x20)[0x7f09fba12710]
[lapsgs24:11937] [10] /home/jaustar/software/precice/2.2.0-debug-mpi-petsc-python/python/lib/python3.6/site-packages/cyprecice.cpython-36m-x86_64-linux-gnu.so(+0xf36d)[0x7f09fc07c36d]
[lapsgs24:11937] [11] python3[0x50a561]
[lapsgs24:11937] [12] python3(_PyEval_EvalFrameDefault+0x444)[0x50bf44]
[lapsgs24:11937] [13] python3[0x507cd4]
[lapsgs24:11937] [14] python3(PyEval_EvalCode+0x23)[0x50ae13]
[lapsgs24:11937] [15] python3[0x635262]
[lapsgs24:11937] [16] python3(PyRun_FileExFlags+0x97)[0x635317]
[lapsgs24:11937] [17] python3(PyRun_SimpleFileExFlags+0x17f)[0x638acf]
[lapsgs24:11937] [18] python3(Py_Main+0x591)[0x639671]
[lapsgs24:11937] [19] python3(main+0xe0)[0x4b0e40]
[lapsgs24:11937] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f0a1e9cfbf7]
[lapsgs24:11937] [21] python3(_start+0x2a)[0x5b2f0a]
[lapsgs24:11937] *** End of error message ***
3 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.