Optimize D_KS even-odd staggered (Feature #563)


Added by Alessandro Sciarra over 5 years ago. Updated about 4 years ago.


Status:In Progress Start date:12 Dec 2013
Priority:Normal Due date:
Assignee:Alessandro Sciarra % Done:

20%

Category:-
Target version:-

Description

This Feature is strongly related to #562 (and I would say with higher priority than that). The problem is that in the staggered formulation we get a worse performance than in the wilson case.


Related issues

related to CL2QCD - Feature #562: Optimize CG-M multi-shifted inverter Done 11 Dec 2013
related to CL2QCD - Feature #733: Profile the rhmc executable In Progress 20 May 2015

Associated revisions

Revision c116b3f8
Added by Alessandro Sciarra over 5 years ago

Added run time as output to dslash_multidev.cpp file.
refs #563

Revision f331e133
Added by Alessandro Sciarra over 5 years ago

Moved a comment on the right line of code.
refs #563

Revision 504ba3c5
Added by Alessandro Sciarra over 5 years ago

Rearranged code in operations_staggered.cl file (better performance now).
refs #563

Revision 47443874
Added by Alessandro Sciarra about 4 years ago

Improved performance D_KS_eo kernel varying the local size value.
refs #563 @30m

History

Updated by Alessandro Sciarra over 5 years ago

So far, the performance has been increased rearranging some code in the operations_staggered.cl file.
Just to have an idea, on a 24^4 lattice, one get:

[15:09:28] INFO: Perform DKS_eo (EVEN + ODD) 2000 times.
[15:09:33] INFO: D_KS performance: 63.9937 GFLOPS
[15:09:33] INFO: D_KS memory: 177.835 GB/S
[15:09:33] INFO: Measured TIME: 5910.34msec

  • % Done changed from 0 to 20

Updated by Alessandro Sciarra over 5 years ago

  • Status changed from New to In Progress

Updated by Alessandro Sciarra about 4 years ago

After a long period of time, I decided to come back to this issue. At the moment the performance of the staggered DKS kernel on a 24^4 lattice is:

Measured performance of 62.5616 GFLOPS
Measured memory of 173.855 GB/S
Measured time of 6045.64 msec

basically the same as one year ago. I will try now to increase it. The comparison with the Wilson code cannot be directly done (one cannot compare the time or the GFLOPS). Nevertheless, one should be able to reach the same percentage of memory bandwidth with respect to the maximum value of the GPU. I will start trying to change the local size used to enqueue the kernel.

Updated by Alessandro Sciarra about 4 years ago

All the performances reported above are done using the default parameters (beta, mass, etc.).

Updated by Alessandro Sciarra about 4 years ago

Always using the default parameters, I run the benchmark executable on a 24^4 lattice with 2000 benchmark steps varying the local size used in the D_KS_eo kernel.

1ls     GFLOPS     GB/s     TIME(ms)
28      30.859     85.75   12256.70
316     57.437    159.62    6585.04
432     74.711    207.62    5062.54
564     77.009    214.00    4911.43
6128    59.599    165.62    6346.24

It is clear that 64 as local size on the AMD Radeon HD 7970 performs much better than the other setups. Moreover this card has a maximum memory bandwidth of 264 GB/s and it means that we can reach over 80% of it. This is exactly the same performance reached in the Wilson code!

Also available in: Atom PDF