I have seen similar behavior before. The main problem is that the residual of a single iteration as an “observation function” and, thus, also the number of iterations has a very bad condition. Small changes wherever can lead to big changes here.
And small changes you already get when running your solver with a different number of threads. I guess that you could reproduce your behavior also on the same system with the same software stack, but with a different number of parallel threads (as soon as you use any collective operations in your solver). Can you?
Honestly, I would recommend to not look too deeply into that hole. Drives you crazy
The good thing, however, is that the average number of iterations is rather independent of this for a sufficient amount of timesteps, especially if you neglect the initial X timesteps.