I was doing a FSI simulation using OpenFOAM and an in-house solver with preCICE while a segmentation fault occurred suddenly (after running for several hours):
forrtl: severe (174)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
FSSICAS20220322 000000000054C3AA for__signal_handl Unknown Unknown
libpthread-2.31.s 000014798E5543C0 Unknown Unknown Unknown
libprecice.so.2.3 000014798E8A4B12 _ZN7precice3com15 Unknown Unknown
libprecice.so.2.3 000014798E8A5553 Unknown Unknown Unknown
libprecice.so.2.3 000014798E8A66C2 Unknown Unknown Unknown
libprecice.so.2.3 000014798E89AA4D _ZN5boost4asio6de Unknown Unknown
libprecice.so.2.3 000014798E8840A8 Unknown Unknown Unknown
libstdc++.so.6.0. 000014798948BDE4 Unknown Unknown Unknown
libpthread-2.31.s 000014798E548609 Unknown Unknown Unknown
libc-2.31.so 000014798E46D293 clone Unknown Unknown
I searched for this problem through related topics(most of them are related to Calculix adapter’s memory leak) and found that the key points were stack size and memory leak.
To get rid of the limitation of stack size, I often used
ulimit -s unlimited before starting the simulation.
As for memory leak, I adopted gnome-system-monitor to see whether the memory kept growing. However, the memory used by our in-house solver was 2.3 GiB and it didn’t increase during the simulation. The total memory of our server is 503.4 GiB while the memory used during the simulatioin is around 23 GiB. So it might not be the reason why the segmentation fault occurred.
I also tried to use valgrind to do memory check but I don’t really know the message it provided…
valgrind.log (1.1 MB)
Moreover, both fluid(OpenFOAM) and solid part (in-house solver) converged well.
The log of OpenFOAM is too large to upload. The log of solid part and config file are uploaded.
FSSICAS.log (1.4 MB)
precice-config.xml (2.6 KB)
What else may lead to this problem? Any hints will be appreciated!
My colleague and I are currently identifying whether the problem is caused by our in-house solver or preCICE. The error message is shown in the following picture:
If the seg fault is caused by preCICE, there will be some prefix like
--[precice] ERROR, right?
It is very difficult to understand what is going on here, due to the many parts marked as “unknown” in the trace. It looks like the segfault is essentially triggered inside preCICE and it is related to the communication (I see
precice com, and
libpthread). But this could very well be caused by external factors. I guess we could only get more information by rebuilding preCICE in debug mode, but I also guess you don’t want to repeat hours of simulations. Since it is triggered just by time, it could also be some communication buffer access issue (maybe @fsimonis could have an idea here, then).
valgrind.log, I get the impression that the simulation does not even get past initialization, which confuses me. Does it correspond to the same problem/situation? This pattern appears near the end (probably written by every process):
==28628== 2,752 bytes in 1 blocks are definitely lost in loss record 54 of 54
==28628== at 0xE2507F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28628== by 0x1CE89702: opal_free_list_grow_st (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libopen-pal.so.40.20.3)
==28628== by 0x3C3D0CC4: ???
==28628== by 0x1CED79C8: mca_btl_base_select (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libopen-pal.so.40.20.3)
==28628== by 0x3C396527: ???
==28628== by 0x1B2AD70A: mca_bml_base_init (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.20.3)
==28628== by 0x1B2ED714: ompi_mpi_init (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.20.3)
==28628== by 0x1B2910B0: PMPI_Init (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.20.3)
==28628== by 0x175E2D2D: precice::utils::Parallel::initializeMPI(int*, char***) (in /usr/lib/x86_64-linux-gnu/libprecice.so.2.3.0)
==28628== by 0x175E2DCD: precice::utils::Parallel::initializeManagedMPI(int*, char***) (in /usr/lib/x86_64-linux-gnu/libprecice.so.2.3.0)
==28628== by 0x175AEA08: precice::impl::SolverInterfaceImpl::configure(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /usr/lib/x86_64-linux-gnu/libprecice.so.2.3.0)
==28628== by 0x175AF79F: precice::impl::SolverInterfaceImpl::SolverInterfaceImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, void*) (in /usr/lib/x86_64-linux-gnu/libprecice.so.2.3.0)
One side comment, looking at your config file: why are you exchanging velocity on the face nodes? Is this really the velocity of the solid motion, or what? Is it the velocity from the
FF module, or did you extend the
FSI module of the adapter?
I also noticed in the
FSISICAS.log the part
FSSI-CAS-2D_Soil Model For Windows OS. Is this really running on Windows?
Thanks for your reply!
When we compile our solver in linux using intel fortran, no error or warning occur. Besides, we’ve checked both fluid and solid results when the segfault occurred. Actually, both results were fine. I also don’t know why the simulation stops at once while using valgrind to debug. So at present, the best way to figure out this problem is rebuilding preCICE in debug mode and run the simulation again to see more information, right?
Firstly, as we are simulating interaction between wave, soil and structure, there is a seepage velocity at the interface calculated by our solid solver which needs to be passed to fluid domain. The seepage velocity is calculated by using the pressure obtained from OpenFOAM and this velocity is stored on cell nodes (variables are stored on cell nodes in our FEM solver).
Yes, I extended the FSI module of the adapter (an older version) according to this topic. Right now, we pass
Pressure from fluid to solid on face centers, and we also pass
Velocity from solid to fluid on face nodes. The modified adapter is uploaded here. Could you please point it out if there is anything wrong in my modification?
openfoam-adapter-OpenFOAM8.zip (1.5 MB)
Since we are currently developing the two-way coupling for our solver in linux, we haven’t changed the name of our solver. It is now running on Ubuntu Mate 20.04.
Since our server has some networking issues, I didn’t succeed in recompiling the preCICE source code in debug mode. Therefore, I decided to recompile preCICE in debug mode on my PC and reran the case to collect information. Right now, the Configuration is Debug in the following picture. Does this mean that the preCICE library is connected well and it’s running in debug mode? (The terminal doesn’t output more message than before) Is there anything needed to collect more information when error happens?
It is compiled in Debug mode, as you can also see by the
---[precice] Configuration: Debug. You have also enabled more information in the
precice-config.xml (I believe), since you now receive more information.
It does not look to me like the initialization has been completed, but maybe it already proceeded in the meantime.
Firstly, thanks for your help!
In previous tests on our server(Kunlun 9016), we found the timing when such segfault happened has randomness. For an entirely same case, this segfault 174 happened at different times (please see the following pictures).
My colleague and I have checked our solver carefully during these days. Our solver has an interface to do one-way coupling in both window and linux platforms. Its stability has been tested by many cases. When we run our solver to do one-way coupling with openfoam, no segfault happens no matter in window or linux. Therefore, we assume the problem is data delivery.
So, I downloaded the latest version of OF adapter and modified it to meet our need. I rebuilt preCICE in debug mode on another small server and ran a simulation with less fluid mesh. The simulation has been running for days and the segfault has not happened yet. If the segfault happens again, I’ll post the information here to discuss.
It’s quite exciting to hear that this problem was fixed!
After the previous case was done, such segfault happened once again for another simulation(also running in debug mode).
I’ll modify the source code as suggested in the topic and recompile precice again to check whether the problem still remains.