HMC test against tmlqcd (Benchmark #271)
In order to gain a first milestone of the performance of the HMC it should be tested against tmlqcd.
24^4 is the largest lattice where the dslash performs well, therefore, this size should be tested.
As a first test, we should just use a non-improved gauge-action and a heavy mass, starting from a cold config (this is similar to a 4^4 sample input from tmlqcd):
Measurements = 10
StartCondition = Cold
2KappaMu = .177
kappa = .177
NSave = 500000
ThetaT = 1.
BCGstabMaxIter = 10000
CGMaxIter = 10000
UseEvenOdd = yes
ReversibilityCheck = no
ReversibilityCheckIntervall = 100
InitialStoreCounter = 0
DebugLevel = 0
Type = wilson
beta = 6.
Timescale = 0
Timescale = 1
2KappaMu = .177
kappa = .177
AcceptancePrecision = 1.e-23
ForcePrecision = 1.e-12
Name = det
#solver = BiCGStab
solver = CG
Type0 = 2MN
Type1 = 2MN
IntegrationSteps0 = 5
IntegrationSteps1 = 8
tau = 0.5
NumberOfTimescales = 2
The 10 HMC are usefull in order not to deal with the same random number again and again. However, the prog should be run several times for statistic and also it should be seen how the acceptance rate behaves, perhaps one has to modify the parameters here (this was just a wild guess).
In addition, the runs should be performed on the LOEWE on one node, i.e. 24 cores.
One has the possibility to use MPI in up to 4 (lattice) dimensions. I would suggest to do that with 2 dimensions (t and x) = (6,4), i.e. one has a local 4 * 6 * 24 * 24 lattice.
If one uses MPI, the times for each hmc trajectory are printed to a file in the end.
I performed some testing of the script and also ran a couple of tmlqcd runs on the LOEWE (see attached input and slurm file).
In addition to the cold starts and 10 hmc steps, I also did one step from a given config.
I do not attach the config for the reread runs here since its 183Mb.
The total runtimes were (statistics of ~10):
- cold start: 739 s
- reread: 78 s
There were some problems when I tried to reproduce the cold start results with OPTIMAL: With the same parameters, OPTIMAL does not get to the same config at the end. At the same time, the HMC tests perform well. I think this will take some time.
However, starting from the config should be compareable also without the same result because the same kind of steps are performed! At least for a first comparision (I attached a suited inputfile).
The configuation currently does not work on the GPUs in LOEWE. The problem is probably caused by the memory requirements. Test on a HD7970 show the following memory requirements:
[10:49:51] [11:49:37] TRACE: Memory usage (Tahiti): 254803976 bytes - Maximum usage: 1656579312 - Host backed memory: 0
This is about 1.5 GiB, more than the HD5870 contains.
24^3 * 8 successfully complets on HD5870. Memory requirements are:
[11:23:27] TRACE: Memory usage (Cypress): 84934664 bytes - Maximum usage: 552425712 - Host backed memory: 0
24^3 * 10 successfully comples on HD5870. Memory requirements are:
[11:44:08] TRACE: Memory usage (Cypress): 106168328 bytes - Maximum usage: 690272496 - Host backed memory: 0
24^3 * 12 fails:
[11:50:03] FATAL: OpenCL failed. Error code -5 in clEnqueueCopyBuffer at /home/compeng/bach/QCD/clhmc/prog/opencl_module.cpp:393
Performance results for the HD7970 sadly cannot be given either, as the GPU hangs up if the hmc is build in release mode.
- Status changed from New to In Progress
The issue on the HD7970 seems to be specific for taht GPU. On the V7800 the code completes with the following performance numbers:
[15:45:27] INFO: ## Program Parts: total perc [15:45:27] INFO: ## Total: 65662291 [15:45:27] INFO: ## Init.: 4386644 6.7 [15:45:27] INFO: ## Perf.: 61274289 93.3 [15:45:27] INFO: Maximum memory used (Cypress): 1592874736 bytes
I performed additional runs with tmlqcd for both testruns extended to 2 and 3 nodes (48 and 72 cores, respectively).
I think it is interesting to comare these to out runtime also. Perhaps a different parallelization might speed up tmlqcd here, I did not try that out yet!
The results are:
|1||739 s||78 s|
|2||411 s||45 s|
|3||299 s||33 s|
There were some mistakes in the input files quoted here. The most important difference is that I used 8 integrationsteps instead of 4 on timescale 1 (I corrected that above, too). In addition, in the optimal input file the integrator type was not specified, therefore it used "leapfrog" as default, whereas tmlqcd uses "2mn".
Changing this, one has to expect a performance time increase!
On the other hand, the difference in the integrator was also responsible for the difference in the results I observed! These vanish once the same integrator is used (see attached log file).
Note: Up to this point, an unimproved gauge action is used. Changing to an improved one I observed a difference in dH, while the resulting configuration was the same. This still has to be investigated!
- File optimal_test_120413 added