09 Dec 2010 02:17 pm
This technical report describes the inner workings of CALDGEMM. It explains in detail how CALDGEMM manages to reach peak performance on current AMD GPUs and how Linpack was adjusted to make perfect use of this library.
18 Apr 2011 03:34 pm
Abstract The installation of the LOEWE-CSC  supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and thus maximize the GPU load. A work stealing scheduler was implemented to add the remaining CPU resources to the DGEMM. On the GPU, the DGEMM achieves 497 GFlop/s (90.9% of the theoretical peak). Combined with the 24-core Magny-Cours CPUs, 623 GFlop/s (83.6% of the peak) are achieved.