HMC hangs for large lattices on Tahiti (Defect #366)


Added by Matthias Bach over 6 years ago. Updated over 6 years ago.


Status:In Progress Start date:21 Nov 2012
Priority:High Due date:
Assignee:Matthias Bach % Done:

50%

Category:-
Target version:-

Description

When using sufficiently large lattices the HMC will get stuck on Tahiti.

Sufficiently large lattices are 32^3x16 and up.

The hang seems to occur in the kernel $gauge_force_tlsym$.


History

Updated by Matthias Bach over 6 years ago

The problem cannot be reproduced when running the kernel in the standalone test.

Updated by Matthias Bach over 6 years ago

The problem cannot be solved by removing the compile time work group size definition.

Updated by Matthias Bach over 6 years ago

It seems running only even or odd sides the kernel does not hang up.

Updated by Matthias Bach over 6 years ago

One can split even and odd sites onto different threads and the hang will go away.

Updated by Matthias Bach over 6 years ago

The application will also hang on the $gaugefield_zero$ kernel.

Replacing it by $clEnqueueFillBuffer$ will avoid that, but one loses compatibility to OpenCL 1.0.

Updated by Matthias Bach over 6 years ago

Major drawback of the solution so far: If the kernels are not in lockstep there is a NaN resulting from the gauge force tlsym kernel

Updated by Matthias Bach over 6 years ago

I found the NaN, a bug introduced during the fix.

It seems the hangs can be circumvented now. Performance has to be rechecked, though.

  • % Done changed from 0 to 50

Updated by Matthias Bach over 6 years ago

For lattices of size 48^3x8 the hang still occurs... :(

Updated by Matthias Bach over 6 years ago

  • Status changed from New to In Progress
  • Priority changed from Normal to High
  • Target version deleted (2012.12)

Also available in: Atom PDF