Re-read of prng state sometimes fails on loewe (Defect #716)
It happened that the saved prng-state could not be reused although the same hardware was used.
This happened only scarcely, but currently the programme does not give further information why the state is not valid.
Matthias, do you know how one could improve this use-case? It should only depend on the number of processors? Perhaps one could save some metainformation with the seeds.
Besides adding some more logging I don't see what can be done to solve the issue. The number of random states is already stored within the file, and I don't know what other meta information could help.
I am not even sure what can go wrong. Do you have the output of such a run? What exactly is the error shown? Is there some systematic to the failures, e.g. they happen always on the same machines? I fear there might be some bad GPUs in that system. In those cases it might be better to exclude those systems in the SLURM scheduling requests.
- Status changed from New to Feedback
- Assignee changed from Matthias Bach to Christopher Pinke