ANNOUNCEMENT

This Wiki is no longer maintained because the repository has moved to https://github.com/davidrohr/hpl-gpu. You can find the new Wiki here.

The old wiki is provided for reference below

Wiki

This version of Linpack differs from the HPL as published by netlib.org in multiple aspects, this includes build requirements and configuration, run configuration and license. All of these are covered in this wiki. A lot of technical details regarding the modifications can be found in the Technical Report.

Requirements

The following software is required by HPL-GPU:

  1. An MPI library, tested are
  2. A BLAS library
    • Only tested version is GotoBLAS
      • On most CPUs this is also the fastest version.
      • CALDGEMM also requires GotoBLAS, therefore this is a natural choice. Read the CALDGEMM wiki on how to properly build GotoBLAS to work with CALDGEMM.
    • Fortran BLAS bindings are untested and might be broken.
  3. CALDGEMM
    • Optional
      • Build without CALDGEMM is untested.
      • Required to use AMD GPUs and some other advanced features.
  4. C++ compiler
    • pthreads support
  5. Intel(R) Threading Building Blocks
    • Optional (The build process will try to download and install TBB if it is not available in the source tree. As long as the computer that builds HPL has internet access you do not have to worry about installing TBB.)
      • Required for improved swaps for which you ironically currently need an AMD CPU.

Building

You can either download the source from the files section or pull it from the git repository.

As the original HPL HPL-GPU requires a build configuration file called Make.ARCHNAME. In this file you will have to adjust the variable ARCH to ARCHNAME and the paths and potentially names of the BLAS and MPI libraries. It can either be placed in the top level directory or in the setup directory. Note, that while the file can reside in either place you must not move it after the initial build as that will break symbolic links to that file created during the initial build.

You will find some examples using CALDGEMM and documenting the additional build configuration options in the setup directory. Those files start with Make.openSUSE112 and Make.LOEWE. Important compile time options are:
  • HPL_CALL_CALDGEMM to enable using CALDGEMM
  • HPL_COPY_L for best performance with CALDGEMM
  • HPL_HAVE_PREFETCHW AMD CPUs have a prefetchw instruction which makes some prefetches more efficient
  • HPL_FASTINIT to use a way faster alternative random number generator for matrix initialization
  • HPL_FASTVERIFY to enable verification if the alternative random number generator is in use
  • HPL_PRINT_INTERMEDIATE to enable progress reporting -- this feature is rather crude, don't trust the estimates
You should also choose a proper configuration of CALDGEMM for your system. You can find instructions on how to find proper configuration in the CALDGEMM wiki. Not all of the CALDGEMM options can be set in the build configuration file. All other options can be set on the structure cal_info in the function CALDGEMM_Init of the file source:testing/util/UTIL_cal.cpp. Examples are:
  • cal_info.SlowCPU - set to true if your system only has a slow CPU and/or very few cores
  • cal_info.DstMemory - Resemples -o in dgemm_bench. Set to 'g' write DGEMM results to GPU or to 'c' to have them directly written to CPU.
  • cal_info.ImplicitDriverSync - set to true to get the same effect as -I in dgemm_bench

Running

As with the original HPL the build will put a binary and a sample configuration into bin/ARCHNAME.

As with the original HPL the configuration file must be called HPL.dat. Note, however, that the options in the configuration file have changed. Therefore you cannot copy a configuration file from the original HPL, but have to create a configuration file anew. The sample file will give you a working configuration for a single nodes. Note that the HPL-GPU process will by default use all cores of a node, therefore you should only run one process per node.

The default values in HPL.dat@ are already tuned for good performance. Take a look at the Technical Report if you want to know how these were found. The most important options which you have to adjust in every case are @N, P and Q, which define the matrix size and the process grid. You can use the script suggest-configs.py in the tools folder to get suggestions for good values. The script is self-documenting. Run suggest-congigs.py -h for help.

The other two interesting options are BCAST and LOOKAHEAD. Contrary to what's stated on netlib.org one of the values 4, 5 or 6 should be used. The recommended value 6 will use the build in MPI broadcast method. Depending on your MPI library and cluster size method 4 or 5 might perform better, though.

The LOOKAHEAD value has a different meaning as in the original HPL. HPL-GPU will always only lookahead one panel, but perform different CPU operations in parallel to the GPU DGEMM. Possible values are:

  • 0 - No Lookahead
  • 1 - Lookahead Mode 1
  • 2 - Lookahead Mode 2

The goal is to keep the GPU computing DGEMM 100% of the time. Higher lookahead modes can mean higher memory bandwidth usage, though. Therefore the Lookahead might not be optimal on every system, especially with few slow CPU cores. For details on the Lookahead modes take a look at the Technical Report where you can also find hints at the other tuning options.

If you are running on a system with different system configurations and speeds take a look at the instructions for heterogeneous systems.

Running happens by either directly launching the xhpl binary or launching the xhpl binary via MPI. Again keep in mind to launch only one process per node. As HPL-GPU, or to be exact CALDGEMM, performs it's own pinning of threads to CPU cores you should make sure MPI doesn't interfere with that. In addition there can be issues with using RDMA in parallel to GPU DMA usage. To avoid these problems setting the following environment options is recommended:

  • MVAPICH - VIADEV_USE_AFFINITY=0 VIADEV_RNDV_PROTOCOL=R3
  • MVAPICH2 - MV2_ENABLE_AFFINITY=0 MV2_RNDV_PROTOCOL=R3 MV2_USE_RDMA_ONE_SIDED=0

OpenMPI does not show any problems with thread pinning.

License

HPL-GPU is made up of parts that are licensed under the GNU General Public License Version 3 and parts that are licensed under the 4-clause BSD license. The license of each source file is noted in the header of the file. The parts licensed under the GNU General Public License Version 3 grant the following special exception:

"Use with the Original BSD License."

Notwithstanding any other provision of the GNU General Public License Version 3, you have permission to link or combine any covered work witha work licensed under the 4-clause BSD license into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the 4-clause BSD license, clause 3, concerning the requirement of acknowledgement in advertising materials will apply to the combination as such.