Reduce Memory Transfer for SU3-Matrices (Feature #236)
Using the techniques described in 0911.3191 one can represent SU3 matrices with less then 9 complex / 18 real numbers.
This of course introduces more computational effort. However, e.g. on GPU the problems are usually bandwidth-limited and the memory-reduction can lead to serious speedups.
The following variants exist (the names are not definitiv):
- GM: The minimum representation are 8 real numbers, where each correspond to the prefactor of 1 su3 generator (the Gell-Mann matrices). However, in the paper above it is stated that the computational overhead is much higher than the bandwidth gain.
- REC12 saves one row (6 complex / 12 real numbers). Here, there should not be any computational problems.
- REC10 saves four entries (5 complex / 10 real numbers). Here, there should not be any computational problems.
- REC8 saves two additional real numbers from REC10 by saving only the phases of 2 entries (4 complex / 8 real numbers). One can run into problems if the entry 00 of the su3 matrix has an absolute value of 1 and if one uses half-precision.
In general, implementing such a feature should only concern functions that explicitely deal with su3 matrices (operations_matrix_su3.cl, operations_su3vec.cl, ...), thus being a rather tedious then difficult task. In addition, one has to be careful when copying the gaugefield from the host to a device. Here, a conversion to the new format should be provided!
So far, REC12 has been implemented and tested. It has been removed again in order to have a more clear structured code in b4c6a5cc. Putting it back in should not be difficult.
The REC10/8 method is described in the paper above. I comprehended the steps (see mathematica file). There is one case which is potentially dangerous. Also, it is not trivial to obtain the phases of a complex number, which is needed for REC8.
With bf51976dd50 I added a working version of REC12 (see attached logfile).
This has been achieved by modifying the getSU3 fct from operations_gaugefield to only read 6 of the 9 su3matrix elements.
REC12 gives a speedup of about 10% to the dslash, which results in a total speedup of about 4% in the inverter.
NOTE: For a sorrow implementation of REC12, one should also modify the gaugefield on the device, ie only store the 6 elements, not all 9.
Then, one will also benefit in terms of avaiable memory. However, this may be postponed due to current infrstructure changes connected with multiple GPU usage.
I also tried out REC10, which can be implemented the same way as REC12, but needs more computations. It seems that the compiler cannot handle my current coding, nans occur. I did not pursue this for now, I think one has to build a stand alone test to get a working implementation. The same will certainly be true for REC8.
- % Done changed from 40 to 80
- File optimal_test_120904 added
- Status changed from New to In Progress