Speed up inverter (Feature #717)


Added by Christopher Pinke almost 3 years ago. Updated almost 3 years ago.


Status:In Progress Start date:13 Nov 2014
Priority:Normal Due date:
Assignee:Francesca Cuteri % Done:

10%

Category:-
Target version:-

Associated revisions

Revision f1057a49
Added by Francesca Cuteri almost 3 years ago

saxpy and gamma5 merged
refs #726, #717

History

Updated by Christopher Pinke almost 3 years ago

I created some approx. thermalized configuration (conf.00149) on a 24^3*8 lattice (pure Wilson gauge/fermion action, kappa=0.165, beta=5.3).

A call like

1./inverter \
2--enable_profiling=true --measure_correlators=false --measure_pbp=true --num_sources=1 --solver=cg \
3--sourcefile=conf.00149 --ns=24 --nt=8 --startcondition=continue --log-level=info --sourcetype=volume \
4--sourcecontent=gaussian --kappa=0.165 --beta=5.3

prints the profiling information obtained during one evaluation of the chiral condensate, given that the inverter executable is compiled in "debug" mode. Also, I switched of the "async_halo_update" as we are working on one device only!

At first glance, the profiling seems to indicate that the gamma5 and the saxpy are currently the worst performing kernels in terms of BW usage. Also, after the dslash, they are the kernels which need most of the time.

  • % Done changed from 0 to 10

Updated by Christopher Pinke almost 3 years ago

Francesca, please check if the things I stated above work for you.

After that, you could try the following things:
  • Find out what the different compile options do (to the fermion fcts.)
  • Check how Flops and BW are calculated in the program (and perhaps re-calc the numbers)
  • Try to find out what the options "--use_merge_kernels_spinor" and "--use_merge_kernels_fermion" do and try to understand the (potential) benefits of these options. There are also tests for these, but they were/are failing on GPUs.
  • Assignee changed from Christopher Pinke to Francesca Cuteri

Updated by Francesca Cuteri almost 3 years ago

The performance of the inverter, with the setup given above (#717) and code as in 2d39a265, was tested.
Performance is reported after 2000 iterations.

With "--use_merge_kernels_fermion=false" one gets:

1[16:25:16] INFO:     SOLVER [CG] [002000]:    CG completed 2000 iterations in 12227 ms
2[16:25:16] INFO:     SOLVER [CG] [002000]:    Performance [FLOPS]: 65.401 GFlops. Performance after warmup: 68.811 Gflops.
3[16:25:16] INFO:     SOLVER [CG] [002000]:    Performance [BANDWIDTH]. 145.218 GB/s. Performance after warmup: 152.826 GB/s.

With "--use_merge_kernels_fermion=false" and a specific size of 64 for the "saxpy_eoprec" kernel one gets:

1[16:27:16] INFO:     SOLVER [CG] [002000]:    CG completed 2000 iterations in 11992 ms
2[16:27:16] INFO:     SOLVER [CG] [002000]:    Performance [FLOPS]: 66.68 GFlops. Performance after warmup: 70.524 Gflops.
3[16:27:16] INFO:     SOLVER [CG] [002000]:    Performance [BANDWIDTH]. 148.058 GB/s. Performance after warmup: 156.63 GB/s.

With "--use_merge_kernels_fermion=true" one gets:

1[16:28:47] INFO:     SOLVER [CG] [002000]:    CG completed 2000 iterations in 11759 ms
2[16:28:47] INFO:     SOLVER [CG] [002000]:    Performance [FLOPS]: 68.005 GFlops. Performance after warmup: 74.688 Gflops.
3[16:28:47] INFO:     SOLVER [CG] [002000]:    Performance [BANDWIDTH]. 151 GB/s. Performance after warmup: 165.879 GB/s.

With "--use_merge_kernels_fermion=true" and a specific size of 64 for the "saxpy_AND_gamma5_eo" kernel one gets:

1[16:30:12] INFO:     SOLVER [CG] [002000]:    CG completed 2000 iterations in 11590 ms
2[16:30:12] INFO:     SOLVER [CG] [002000]:    Performance [FLOPS]: 68.994 GFlops. Performance after warmup: 76.625 Gflops.
3[16:30:12] INFO:     SOLVER [CG] [002000]:    Performance [BANDWIDTH]. 153.196 GB/s. Performance after warmup: 170.181 GB/s.

With "--use_merge_kernels_fermion=true" and a specific size of 64 for both the "saxpy_eoprec" and "saxpy_AND_gamma5_eo" kernel one gets:

1[16:31:25] INFO:     SOLVER [CG] [002000]:    CG completed 2000 iterations in 10923 ms
2[16:31:25] INFO:     SOLVER [CG] [002000]:    Performance [FLOPS]: 73.207 GFlops. Performance after warmup: 78.797 Gflops.
3[16:31:25] INFO:     SOLVER [CG] [002000]:    Performance [BANDWIDTH]. 162.551 GB/s. Performance after warmup: 175.004 GB/s.

Notice that in 2d39a265 the specific size of 64 for the "saxpy_eoprecsaxpy_AND_gamma5_eo" kernel is set in "fermions.cpp", while the size for "saxpy_eoprec" is not specified in "spinors.cpp". To reproduce the indicated performances, one should take this into account and modify the code accordingly.

Also available in: Atom PDF