Optimize D_KS even-odd staggered (Feature #563)
This Feature is strongly related to #562 (and I would say with higher priority than that). The problem is that in the staggered formulation we get a worse performance than in the wilson case.
|related to CL2QCD - Feature #562: Optimize CG-M multi-shifted inverter||Done||11 Dec 2013|
|related to CL2QCD - Feature #733: Profile the rhmc executable||In Progress||20 May 2015|
Rearranged code in operations_staggered.cl file (better performance now).
So far, the performance has been increased rearranging some code in the operations_staggered.cl file.
Just to have an idea, on a 24^4 lattice, one get:
[15:09:28] INFO: Perform DKS_eo (EVEN + ODD) 2000 times.
[15:09:33] INFO: D_KS performance: 63.9937 GFLOPS
[15:09:33] INFO: D_KS memory: 177.835 GB/S
[15:09:33] INFO: Measured TIME: 5910.34msec
- % Done changed from 0 to 20
After a long period of time, I decided to come back to this issue. At the moment the performance of the staggered DKS kernel on a 24^4 lattice is:
Measured performance of 62.5616 GFLOPS
Measured memory of 173.855 GB/S
Measured time of 6045.64 msec
basically the same as one year ago. I will try now to increase it. The comparison with the Wilson code cannot be directly done (one cannot compare the time or the GFLOPS). Nevertheless, one should be able to reach the same percentage of memory bandwidth with respect to the maximum value of the GPU. I will start trying to change the local size used to enqueue the kernel.
Always using the default parameters, I run the benchmark executable on a 24^4 lattice with 2000 benchmark steps varying the local size used in the D_KS_eo kernel.
1ls GFLOPS GB/s TIME(ms) 28 30.859 85.75 12256.70 316 57.437 159.62 6585.04 432 74.711 207.62 5062.54 564 77.009 214.00 4911.43 6128 59.599 165.62 6346.24
It is clear that 64 as local size on the AMD Radeon HD 7970 performs much better than the other setups. Moreover this card has a maximum memory bandwidth of 264 GB/s and it means that we can reach over 80% of it. This is exactly the same performance reached in the Wilson code!