HMC test against tmlqcd (Benchmark #271)


Added by Christopher Pinke over 7 years ago. Updated over 7 years ago.


Status:In Progress Start date:14 Mar 2012
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:-
Target version:2012.2 Estimated time:5.00 hours

Description

In order to gain a first milestone of the performance of the HMC it should be tested against tmlqcd.

24^4 is the largest lattice where the dslash performs well, therefore, this size should be tested.

As a first test, we should just use a non-improved gauge-action and a heavy mass, starting from a cold config (this is similar to a 4^4 sample input from tmlqcd):

L=24
T=24
Measurements = 10
StartCondition = Cold
2KappaMu = .177
kappa = .177
NSave = 500000
ThetaT = 1.
BCGstabMaxIter = 10000
CGMaxIter = 10000
UseEvenOdd = yes
ReversibilityCheck = no
ReversibilityCheckIntervall = 100
InitialStoreCounter = 0
DebugLevel = 0

BeginMonomial GAUGE
Type = wilson
beta = 6.
Timescale = 0
EndMonomial

BeginMonomial DET
Timescale = 1
2KappaMu = .177
kappa = .177
AcceptancePrecision = 1.e-23
ForcePrecision = 1.e-12
Name = det
#solver = BiCGStab
solver = CG
EndMonomial

BeginIntegrator
Type0 = 2MN
Type1 = 2MN
IntegrationSteps0 = 5
IntegrationSteps1 = 8
tau = 0.5
NumberOfTimescales = 2
EndIntegrator

The 10 HMC are usefull in order not to deal with the same random number again and again. However, the prog should be run several times for statistic and also it should be seen how the acceptance rate behaves, perhaps one has to modify the parameters here (this was just a wild guess).

In addition, the runs should be performed on the LOEWE on one node, i.e. 24 cores.
One has the possibility to use MPI in up to 4 (lattice) dimensions. I would suggest to do that with 2 dimensions (t and x) = (6,4), i.e. one has a local 4 * 6 * 24 * 24 lattice.

If one uses MPI, the times for each hmc trajectory are printed to a file in the end.


tmlqcd_job.slurm_reread (1.7 kB) Christopher Pinke, 16 Mar 2012 12:43 pm

tmlqcd_input_reread (1.4 kB) Christopher Pinke, 16 Mar 2012 12:43 pm

OPTIMAL_input_reread (661 Bytes) Christopher Pinke, 14 Apr 2012 12:09 pm

OPTIMAL_input_cold (635 Bytes) Christopher Pinke, 14 Apr 2012 12:09 pm

optimal_test_120413 (7.5 kB) Christopher Pinke, 17 Apr 2012 11:35 am


History

Updated by Christopher Pinke over 7 years ago

I performed some testing of the script and also ran a couple of tmlqcd runs on the LOEWE (see attached input and slurm file).
In addition to the cold starts and 10 hmc steps, I also did one step from a given config.
I do not attach the config for the reread runs here since its 183Mb.

The total runtimes were (statistics of ~10):

  • cold start: 739 s
  • reread: 78 s

There were some problems when I tried to reproduce the cold start results with OPTIMAL: With the same parameters, OPTIMAL does not get to the same config at the end. At the same time, the HMC tests perform well. I think this will take some time.

However, starting from the config should be compareable also without the same result because the same kind of steps are performed! At least for a first comparision (I attached a suited inputfile).

Updated by Matthias Bach over 7 years ago

The configuation currently does not work on the GPUs in LOEWE. The problem is probably caused by the memory requirements. Test on a HD7970 show the following memory requirements:
[10:49:51] [11:49:37] TRACE: Memory usage (Tahiti): 254803976 bytes - Maximum usage: 1656579312 - Host backed memory: 0
This is about 1.5 GiB, more than the HD5870 contains.

24^3 * 8 successfully complets on HD5870. Memory requirements are:
[11:23:27] TRACE: Memory usage (Cypress): 84934664 bytes - Maximum usage: 552425712 - Host backed memory: 0

24^3 * 10 successfully comples on HD5870. Memory requirements are:
[11:44:08] TRACE: Memory usage (Cypress): 106168328 bytes - Maximum usage: 690272496 - Host backed memory: 0

24^3 * 12 fails:
[11:50:03] FATAL: OpenCL failed. Error code -5 in clEnqueueCopyBuffer at /home/compeng/bach/QCD/clhmc/prog/opencl_module.cpp:393

Performance results for the HD7970 sadly cannot be given either, as the GPU hangs up if the hmc is build in release mode.

  • Status changed from New to In Progress

Updated by Matthias Bach over 7 years ago

The issue on the HD7970 seems to be specific for taht GPU. On the V7800 the code completes with the following performance numbers:

[15:45:27] INFO: ## Program Parts:      total    perc
[15:45:27] INFO: ## Total:          65662291
[15:45:27] INFO: ## Init.:           4386644      6.7
[15:45:27] INFO: ## Perf.:          61274289     93.3
[15:45:27] INFO: Maximum memory used (Cypress): 1592874736 bytes

Updated by Christopher Pinke over 7 years ago

I performed additional runs with tmlqcd for both testruns extended to 2 and 3 nodes (48 and 72 cores, respectively).
I think it is interesting to comare these to out runtime also. Perhaps a different parallelization might speed up tmlqcd here, I did not try that out yet!

The results are:

#nodes cold reread
1 739 s 78 s
2 411 s 45 s
3 299 s 33 s

Updated by Christopher Pinke over 7 years ago

There were some mistakes in the input files quoted here. The most important difference is that I used 8 integrationsteps instead of 4 on timescale 1 (I corrected that above, too). In addition, in the optimal input file the integrator type was not specified, therefore it used "leapfrog" as default, whereas tmlqcd uses "2mn".
Changing this, one has to expect a performance time increase!

On the other hand, the difference in the integrator was also responsible for the difference in the results I observed! These vanish once the same integrator is used (see attached log file).

Note: Up to this point, an unimproved gauge action is used. Changing to an improved one I observed a difference in dH, while the resulting configuration was the same. This still has to be investigated!

  • File optimal_test_120413 added

Updated by Christopher Pinke over 7 years ago

  • File deleted (OPTIMAL_input_reread)

Updated by Christopher Pinke over 7 years ago

Corrected optimal input file for reread and cold start.

Updated by Christopher Pinke over 7 years ago

  • File deleted (optimal_test_120413)

Updated by Christopher Pinke over 7 years ago

There was some confusion with the integrators used in tmlqcd. I updated my test-log accordingly. With unimproved gaugeaction both programs still agree!

Also available in: Atom PDF