I am using preCICE on a large HPC (~order of 10^3 nodes), leading to that the nodes have multiple network names that might need to be fed into precice-config.xml file to enable communication across more than one nodes.
I searched some of previous articles on this channel and apparently, the suggested solution by far is to get network name (e.g., ip link) and feed it to the config file mannually. I think this works fine still for a small cluster. In a very large cluster, like what I’m working with, has multiple network name and using slurm like task manager, it is almost impossible to know apriori which network name would be assigned to my job. This creates huge bottle neck for my simulation setup with precice on the cluster.. I wonder if there is any update on this aspect!
is this a homogeneous cluster, or are you trying to use nodes with different configurations (such as different CPUs)? Are the different networks for different partitions? The typical use case is running in multiple nodes of the same architecture.
The main idea for the network attribute is that the default network is typically one only accessible within one node/island, while there is typically a larger common network connecting nodes.
In any case, if the setup is different, it would be interesting to know more about it to find out how we could support it.
I wonder what kinds of information that I can provide further to let you better understand my difficulty. For example, are there any suggestions of terminal commands that I can type and put what returned in this conversation? I kind of understood what your comments are about, but would like to learn what the information should be.
Are you using a cluster with public documentation? The networks should be documented there, and a link could help.
Please forgive my question (I have no clue who I am talking to), but have you already talked about this with your cluster admin? The issue might primarily be system-specific, and then maybe something we could address.
There are some notes on ‘Network’, but they are about the speed of the network communication (OOO Gbs speed in communication, etc.). I am not sure what information is exactly needed from the documentation, neither what information exactly I should ask to the cluster admin.
preCICE doc. says that users should provide network name for the m2n socket communication in case the communication across mutliple nodes, but, I have not explained & inquired about this to the cluster admin. This is mainly because the similar issue has caused to me whenever I tried to use preCICE in large-scale hpc cluster (not only for this one), meaning that I had to find proper network name via trial-and-error approach. However, at this time, even this trial-and-error approach (meaning that switching the network name one-by-one until my slurm job finally runs) seems not working, unfortunately.
If there is a suggestion of question that I can ask to the cluster admin, I can ask, but at this point, I am not sure what the question is that I need to ask for.
This sounds like an interesting case, but I also expect that this should be solvable. Other parallel application seem to work on the cluster.
It would be interesting to get some more information to get a better understanding of what is not working as intended.
Do you have a sample preCICE configuration and a SLURM script that do not work that you could share?
On the compute nodes, you could run ip link show or netstat -i to get an overview over available network interfaces. Could you share the output of this command?
What is the actual error that run into? Does the simulation not start or is it simply very slow?
Could you share the preCICE log of the participants that have issues?