A case can run normally on the local machine, but stuck on the cluster during the precice initialization phase

Dear all,

I am simulating the problem of elastic structures entering into water. This is a similar blog. I am still very confused about this issue. https://precice.discourse.group/t/running-the-case-on-the-cluster-without-errors-but-hanging-for-a-long-time-in-the-first-step/2150

The case can run smoothly on the local machine, but when submit it on the cluster, it will stuck during the precice initialization phase for a very long time. The software versions I am using are OpenFoam 1912, Calculix 2.20, and Precice 2.3.0. The software configuration is exactly the same for local and cluster, the difference is that local uses ubuntu 20.04 system and cluster uses centos7 system, in addition, I use slurm for job submission.

It should be noted that I have run some small test cases on the cluster and they can start simulation quite normally, which means that there should be no problems with the software configuration of the cluster. One node on the cluster has 128 cores, so I don’t need to deal with cross-node communication and double-allocation of mpi as I only use one node for a time. And I don’t use a hostfile to allocate compute resources. In addition, since my case does not involve cross node communication issues, I still use the Lo loopback interface in the precice configuration of the network. If there are any issues with what I am doing here, please help point them out.

The log file shows that calculix can establish communication very quickly, but openfoam is always stuck at

–[precice] 0m Compute “write” mapping from mesh “Fluid-Mesh-Centers” to mesh “Solid-Mesh”.

The coupling interface meshes in the present work are quite large, perhaps in the tens of thousands. But precice is capable of handling massive parallelism, isn’t it? I’m using rbf mapping with a support radius.

Thank you for reading my wall of text. The relevant documents are listed below and any thoughts or comments are appreciated. :wink:
runFluid.txt (258 Bytes)
runSolid.txt (100 Bytes)
slurm.txt (267 Bytes)
FluidLog.txt (3.8 KB)
SolidLog.txt (7.1 KB)

If I change the mapping to nearest-neighbor, the case on the cluster can also be run very quickly. I think it’s caused by the number of coupled grids being too large, but I’m puzzled as to why I don’t have this problem locally.

I would start with:

  1. Use a different network interface anyway. Presumably Infiniband is available since you have a cluster.
  2. Double-check that the solver executables on the cluster link to the libraries you expect.

Thanks very much for your reminder.
While doing a check on the installation I realized that I may have accidentally turned off some feature of precice. Because I used the following command when building the release version of precice.

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=~/software/precice -DPRECICE_PETScMapping=OFF -DPRECICE_PythonActions=OFF …

This turns off the RBF mapping function based on MPI-parallel. The result is that the RBF mapping is serial and therefore hangs indefinitely after a large amount of coupled meshes. I’ll try again to remove the extra parameter definitions and rebuild precice and I think things will get better. :grinning:

Dear all,
I recompiled precice and configured it using

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=~/software/precice …

And the network=“ib0” has been defined in the precice-config.xml. But when i run the case. CalculiX runs well but OpenFOAM still stuck at

Iprecice] [0m Using tree-based preallocation for matrix C
[precicel [8m Using tree-based preallocation for matrix A

I am using rbf mapping based on PETSc with a support radius of 0.05. I can easily start simulating with nearest neighbor, but it seems to diverge easily.

I’m completely lost. My coupling interface has tens of thousands of nodes and it can run without problems at local machine with a Ubuntu2004 system. Can anyone give me some advice? I would greatly appreciate it.

Dear all,
I use ctest to check my precice installation. There are some tests failed. Will this affect the initialization speed?

83% tests passed, 5 tests failed out of 29
Label Time Summary:
Solverdummy = 4.01 sec
mpiports = 40.12 sec
Total Test time (real) = 192.56 sec
The following tests FAILED:
1 - precice.acceleration (Timeout)
4 - precice.com.mpiports (Timeout)
8 - precice.m2n.mpiports (Timeout)
15 - precice.serial (Timeout)
16 - precice.parallel (Timeout)
Errors while running CTest
I don’t have much experience with precice, especially in cluster operation. Please help me. :upside_down_face: