Merge saxpy and gamma5 (Feature #726)
Francesca, please start with the implementation.
To start, look at the existing tests in the file given above and also at the tests for the original kernels.
The tests for the merged kernel should more or less cover all these testcases, too.
If you have any questions, we can look at the code together.
- Assignee changed from Christopher Pinke to Francesca Cuteri
- Status changed from New to In Progress
I tested the performance of the fermionmatrix with the setup given in #717.
I simply commented everything out of the solver except of the fermionmatrix application, and reported on the performance after 2000 iterations.
The result is that the fermionmatrix (which is QplusQminus in this case) performs at ~80 Gflops:
1[11:39:32] INFO: SOLVER [CG] : CG completed in 9302 ms @ 81.055 Gflops. Performed 2001 iterations. Performance after warmup: 80.975 Gflops
However, the dslash alone achieves ~110 Gflops. The difference of 30 Gflops should be caused by gamma5, which makes lousy 3 Gflops, and the saxpy operation.
Given that the complete CG performs at ~70 Gflops for this setup, this seems to indicate that merging the saxpy and gamma5 kernels could indeed give a visible speedup to the inverter.
Actually, simply leaving out the gamma5 from the fermion matrix gives ~9 Gflops more, which could be the benefit of the merging (in case the merging works "perfectly"):
111:47:30] INFO: SOLVER [CG] : CG completed in 8462 ms @ 89.102 Gflops. Performed 2001 iterations. Performance after warmup: 89.014 Gflops.
Still this would mean that one is loosing 20 Gflops compared to the single dslash. In case the saxpy cannot be accelerated anymore, one could then think about merging the gamma5 and saxpy operation with (1+dslash)...