Reduce Memory Transfer for SU3-Matrices (Feature #236)


Added by Christopher Pinke almost 8 years ago. Updated almost 7 years ago.


Status:In Progress Start date:10 Nov 2011
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:-
Target version:-

Description

Using the techniques described in 0911.3191 one can represent SU3 matrices with less then 9 complex / 18 real numbers.
This of course introduces more computational effort. However, e.g. on GPU the problems are usually bandwidth-limited and the memory-reduction can lead to serious speedups.

The following variants exist (the names are not definitiv):

  • GM: The minimum representation are 8 real numbers, where each correspond to the prefactor of 1 su3 generator (the Gell-Mann matrices). However, in the paper above it is stated that the computational overhead is much higher than the bandwidth gain.
  • REC12 saves one row (6 complex / 12 real numbers). Here, there should not be any computational problems.
  • REC10 saves four entries (5 complex / 10 real numbers). Here, there should not be any computational problems.
  • REC8 saves two additional real numbers from REC10 by saving only the phases of 2 entries (4 complex / 8 real numbers). One can run into problems if the entry 00 of the su3 matrix has an absolute value of 1 and if one uses half-precision.

In general, implementing such a feature should only concern functions that explicitely deal with su3 matrices (operations_matrix_su3.cl, operations_su3vec.cl, ...), thus being a rather tedious then difficult task. In addition, one has to be careful when copying the gaugefield from the host to a device. Here, a conversion to the new format should be provided!

So far, REC12 has been implemented and tested. It has been removed again in order to have a more clear structured code in b4c6a5cc. Putting it back in should not be difficult.

The REC10/8 method is described in the paper above. I comprehended the steps (see mathematica file). There is one case which is potentially dangerous. Also, it is not trivial to obtain the phases of a complex number, which is needed for REC8.


REC12andREC10andREC8.nb (24.7 kB) Christopher Pinke, 04 Sep 2012 02:00 pm

optimal_test_120904 (5.4 kB) Christopher Pinke, 05 Sep 2012 09:07 am


History

Updated by Christopher Pinke almost 7 years ago

  • File deleted (REC10andREC8.nb)

Updated by Christopher Pinke almost 7 years ago

With bf51976dd50 I added a working version of REC12 (see attached logfile).
This has been achieved by modifying the getSU3 fct from operations_gaugefield to only read 6 of the 9 su3matrix elements.
REC12 gives a speedup of about 10% to the dslash, which results in a total speedup of about 4% in the inverter.

NOTE: For a sorrow implementation of REC12, one should also modify the gaugefield on the device, ie only store the 6 elements, not all 9.
Then, one will also benefit in terms of avaiable memory. However, this may be postponed due to current infrstructure changes connected with multiple GPU usage.

I also tried out REC10, which can be implemented the same way as REC12, but needs more computations. It seems that the compiler cannot handle my current coding, nans occur. I did not pursue this for now, I think one has to build a stand alone test to get a working implementation. The same will certainly be true for REC8.

  • % Done changed from 40 to 80
  • File optimal_test_120904 added
  • Status changed from New to In Progress

Also available in: Atom PDF