# Reduce Memory Transfer for SU3-Matrices (Feature #236)

**Description**

Using the techniques described in 0911.3191 one can represent SU3 matrices with less then 9 complex / 18 real numbers.

This of course introduces more computational effort. However, e.g. on GPU the problems are usually bandwidth-limited and the memory-reduction can lead to serious speedups.

The following variants exist (the names are not definitiv):

- GM: The minimum representation are 8 real numbers, where each correspond to the prefactor of 1 su3 generator (the Gell-Mann matrices). However, in the paper above it is stated that the computational overhead is much higher than the bandwidth gain.
- REC12 saves one row (6 complex / 12 real numbers). Here, there should not be any computational problems.
- REC10 saves four entries (5 complex / 10 real numbers). Here, there should not be any computational problems.
- REC8 saves two additional real numbers from REC10 by saving only the phases of 2 entries (4 complex / 8 real numbers). One can run into problems if the entry 00 of the su3 matrix has an absolute value of 1 and if one uses half-precision.

In general, implementing such a feature should only concern functions that explicitely deal with su3 matrices (operations_matrix_su3.cl, operations_su3vec.cl, ...), thus being a rather tedious then difficult task. In addition, one has to be careful when copying the gaugefield from the host to a device. Here, a conversion to the new format should be provided!

So far, REC12 has been implemented and tested. It has been removed again in order to have a more clear structured code in b4c6a5cc. Putting it back in should not be difficult.

The REC10/8 method is described in the paper above. I comprehended the steps (see mathematica file). There is one case which is potentially dangerous. Also, it is not trivial to obtain the phases of a complex number, which is needed for REC8.

### History

#### Updated by Christopher Pinke almost 7 years ago

**File**deleted ()*REC10andREC8.nb*

#### Updated by Christopher Pinke almost 7 years ago

**File**REC12andREC10andREC8.nb added

#### Updated by Christopher Pinke almost 7 years ago

With bf51976dd50 I added a working version of REC12 (see attached logfile).

This has been achieved by modifying the getSU3 fct from operations_gaugefield to only read 6 of the 9 su3matrix elements.

REC12 gives a speedup of about 10% to the dslash, which results in a total speedup of about 4% in the inverter.

NOTE: For a sorrow implementation of REC12, one should also modify the gaugefield on the device, ie only store the 6 elements, not all 9.

Then, one will also benefit in terms of avaiable memory. However, this may be postponed due to current infrstructure changes connected with multiple GPU usage.

I also tried out REC10, which can be implemented the same way as REC12, but needs more computations. It seems that the compiler cannot handle my current coding, nans occur. I did not pursue this for now, I think one has to build a stand alone test to get a working implementation. The same will certainly be true for REC8.

**% Done**changed from*40*to*80***File**optimal_test_120904 added**Status**changed from*New*to*In Progress*