Problem in calling the adapted Calculix on HPC

Hi all. I am trying to implement FSI simulation on the UK’s HPC-ARCHER2, as the first one to do so on this HPC cluster. The technicians there have just assisted me in installing all the preCICE, OpenFOAM and Calculix packages, including the adapters and the dependencies. I can see when I run the fluid side with pimpleFoam, it starts as normal and is waiting for coupling. However, when I try to call the solid part and execute the command ‘ccx_preCICE -i FOILTE -precice-participant Solid’, it shows the following errors:

Setting up preCICE participant Solid, using config file: config.yml
terminate called after throwing an instance of ‘YAML::TypedBadConversion<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >’
what(): bad conversion

The test case has been running well on my personal workstation, so I am wondering whether we need to somehow modify the config.yml file when we are using a specific HPC? Or it is due to other reasons? Could anyone suggest on this please? Many thanks. The scripts and job outputs are all attached.

Best wishes,
Yabin
slurm-3831345.out (3.0 KB)
[slurm-3831361.out|attachment](upload://cNaqsWgTFWR
run_Calculix.txt (874 Bytes)
run_OpenFOAM.txt (1.0 KB)
vNsChw5vVYjuQCqC.out) (1.6 KB)

slurm-3831361.out (1.6 KB)
Sorry I forgot to attach the errors for the Calculix side.

Hi @YabinLiu,

that’s interesting… Could you please also attach your config.yml?

If it works on one system but not on the other, then maybe the yaml-cpp version could be important. Especially that you use the same version when compiling and when running.

But since it complains about a bad conversion, I would assume that this is related to the config.

config.yml (229 Bytes)
precice-config.xml (2.4 KB)
Hi Makis,
thank you very much. I have attached the config.yml and precice-config.xml files. I have confirmed these files are placed in the correct folders, and they are the same as the files I am using on my workstation. I think we were using the compatible versions of these packages when compiling on ARCHER2.

run_Calculix.txt (967 Bytes)
Sorry the script for the Calculix side should be this one.

Hi @YabinLiu,

independent (maybe) of the YAML-related issue you got, the XML file has some syntax error:

      <mapping:nearest-neighbor
        direction="write"
        from="Fluid-Mesh"
        to="Solid-Mesh"
        constraint="conservative" />
        timing="initial" />
      <mapping:nearest-neighbor
        direction="read"
        from="Solid-Mesh"
        to="Fluid-Mesh"
        constraint="consistent" />
        timing="initial" />

Do you see the duplicate /> ? But I don’t think you need the timing="initial" anyway. Other than that, the config looks like this, which looks good:

The YAML file looks fine to me.

Hi Makis,
Thanks for your check. I have found the reason for the previous YAML-related, as the following commands need to be added in the attached slurm file:

export CPLUS_INCLUDE_PATH=$PRFX/yaml-cpp-yaml-cpp-0.6.2/include:${CPLUS_INCLUDE_PATH}
export LD_LIBRARY_PATH=$PRFX/yaml-cpp-yaml-cpp-0.6.2/build:${LD_LIBRARY_PATH}

One a single node, the FSI coulpling goes well when using one processor for OpenFOAM and one/multiple processors for Calculix. However, when I want to run OpenFOAM in parallel, the communication between OpenFOAM cannot happen and got the attached error, which I suspect is caused by the decomposition of OpenFOAM. There is no such problem on my workstation, so I suppose there is something to do with the ARCHER2 architecture for data exchange.

If I use multiple nodes, the FSI coupling also cannot start, even if I just use one processor for OpenFOAM.

Do you have similar experiences on other HPC systems? Could you please advise on this?

Many thanks,
Yabin
2subjobs_preCICE_MultipleProcessors.out (7.6 KB)
2subjobs_preCICE_MultipleProcessors.slurm.txt (2.0 KB)

Do you have similar issues even if you use fewer ranks? Could this be related to memory usage? I don’t have any lead here…

At the appendix B of this report there are some hints about running preCICE on ARCHER2. Maybe this helps.

In any case, it would be great if you could document your findings in a new page under this documentation section: Special systems | preCICE - The Coupling Library

Hi Makis,
This always happens if one runs OpenFOAM in parallel, even just with two processors, as long as the OpenFOAM files need to be decomposed. I believe this is no memory issue, as the test cases are very simple 2D ones, and it even runs well on a personal laptop. It seems just to be related to how the OpenFOAM exchange data with Calculix through preCICE on the cluster.

I have been interacting with an ARCHER2 technician. I will update our experiences under the documentation section once we have solved this final problem.

Cheers,
Yabin

Hi @YabinLiu,

I think you are not really running into a problem with the parallel solvers.

The OpenFOAM at startup states:

Date : Jul 13 2023
Time : 17:39:53

Slurm aborts the jobs at 17:41:43 after 32 seconds of waiting, so 17:41:11

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3919539.1 ON nid001085 CANCELLED AT 2023-07-13T17:41:43 ***
slurmstepd: error: *** STEP 3919539.0 ON nid001085 CANCELLED AT 2023-07-13T17:41:43 ***

This means, your case runs for only one minute before Slurm decides to cancel it.

Running solvers in parallel adds a significant workload in the initialization phase.
The biggest chunk is the vertex ownership deduction in the re-partitioning phase, which starts after the mapping is computed.
The last log is about the mapping, so this lines up.

Maybe bump the max runtime to 5 minutes or so and you should see more output.

Hi @fsimonis , thanks for the kind suggestions. I also suspected this previously, and I have let the job run for more than 20 minutes, but the output always stops at the line of ‘Mapping distance not available due to empty partition.’

Moveover, I also tested the situation of using two processors for OpenFOAM, and the FSI cannot run anyway. Therefore, I don’t belive this is caused by the workload in the initialization phase. It should still be related to the communication problem when I decompose the OpenFOAM files to let them run in parallel

Cheers, Yabin

Hi,

In that case, the next step would be to enable debug output (Debug build or Release with PRECICE_RELEASE_WITH_DEBUG_LOG=YES).

Then enable debug logs in your configuration file and you should be able to see in detail where which solver hangs. You may want to add the name of the solver and the rank to the log output. Checkout our docs for logging examples.

Cheers,
Frédéric