Merged spinor kernels broken (Defect #416)


Added by Christopher Pinke almost 5 years ago. Updated almost 3 years ago.


Status:Feedback Start date:27 Jan 2013
Priority:Normal Due date:
Assignee:Matthias Bach % Done:

100%

Category:-
Target version:-

Description

Currently, there is one merge spinor kernel: saxpy + Squarenorm.
The correspondig tests all fail.


Related issues

follows CL2QCD - Unit Test #323: Merged fermionmatrix kernels including dslash broken? New 25 Jan 2013 25 Jan 2013

History

Updated by Christopher Pinke almost 3 years ago

As of now, the test passes on the GPU in gpu-dev03.

I also implemented the use of the merged kernel in the single-device CG, and it gives the same result as with the unmerged kernels.
Here, I used the setup given in #717.

Note that, strangely, the merged kernels do not increase the speed of the inverter!

  • Status changed from New to Done
  • % Done changed from 0 to 100

Updated by Christopher Pinke almost 3 years ago

Matthias, do you have any idea why merging the two kernels does not give a speedup? Actually, I get 10 GLOPS less when using the merged kernels in one specific setup.

Naively, one has to read and write one whole spinorfield less. Could it just be the shadowing of kernel execution with more kernels?

  • Status changed from Done to Feedback
  • Assignee changed from Christopher Pinke to Matthias Bach

Updated by Matthias Bach almost 3 years ago

Did you check for register spilling? The forces-kernel have been split in six ways to avoid register spilling, more than doubling net-bandwidth utilization (even though total bandwidth consumption rises and prevents reaching the peak with the net utilization).

Updated by Christopher Pinke almost 3 years ago

There is actually no register spilling

1[09:49:19] TRACE: Reading information from file saxpy_AND_squarenorm_eo_Tahiti.isa
2[09:49:19] DEBUG: Kernel: saxpy_AND_squarenorm_eo - 34 sGPRs, 118 vGPRs, 0 scratch registers, 32768 bytes statically allocated local memory

To give some numbers, performing 2000 iterations of the CG (with the setup from #717) with kernel merging:

1[10:22:01] INFO:     SOLVER [CG] [002000]:    CG completed in 13552 ms @ 59.544 Gflops. Performed 2000 iterations. Performance after warmup: 63.347 Gflops.

and without:
1[10:22:51] INFO:     SOLVER [CG] [002000]:    CG completed in 12091 ms @ 66.736 Gflops. Performed 2000 iterations. Performance after warmup: 70.237 Gflops.

Hence, without merging the solver is 10% faster!

The performance of the merged kernel is not bad, it makes 123 GB/s, but the original ones are just better (in the cg, one actually uses the scalar-product instead of the squarenorm!):

1                    saxpy_eoprec        969418  4022    241     0       135.814584734346        22.0239867136777        31.21875        5308416
2           scalar_product_eoprec        220162  2022    108     0       203.138619743643        48.753250370182 21.09375        5308414
3         saxpy_AND_squarenorm_eo        552178  2001    275     0       123.436164309335        29.4563501443375        32.484375       8128510

What is completely puzzling is that the avg. time of the merged kernel is smaller than the sum of the two original ones. This means that it should actually be faster! The numbers above yield that the time it takes to do ~2000 times saxpy + scalar_product is ~700k mus, while the merged kernel only takes 550k mus. This is a ~25% effect.

Updated by Christopher Pinke almost 3 years ago

Matthias, do you have a suggestion how to investigate this behaviour? Perhaps profiling?

Updated by Matthias Bach almost 3 years ago

This might be related to the local memory usage required for the reduction. This probably causes less groups to be scheduled concurrently and might lower the effectivness of the saxpy, which essentially does little moren than memcpy. Might be worth to play around with group counts and sizes to vary occupancy. Clark hat a very interesting paper to auto-vary those parameters during runtime to find the optimum, similar to how it's done in caldgemm. This is just a guess so, would have to think more deeply into it to give a founded answer.

Updated by Christopher Pinke almost 3 years ago

I tried to set the sizes of the workgroups manually using attributes. This works very well, apparently one should not use 128 but 64 as workgroup size. This gives an 50% increase in the gamma5 and saxpy kernels (eo)!

A typical profile looks like this (for the setup from above):

1#device 0       Time [mus]      Calls   Avg Time [mus]  Avg Time/Site [mus]     BW [GB/s]       FLOPS [GFLOP/s] Re/Wr [MB]      FLOP
2gamma5_eo    138355  1016    136     0       155.927885685375        4.87274642766796        20.25   663552
3dslash_eo    1655180 2032    814     0       195.508065201368        110.787903614108        151.875 90243072
4saxpy_eoprec    277045  1508    183     0       178.183074431951        28.8945526105867        31.21875        5308416
5scalar_product_reduction    94530   1009    93      0       0.0220308473500476      196897966469617 0.0019683837890625      18446744073709551615
6scalar_product_eoprec    83686   1008    83      0       266.416690963841        63.9399817412709        21.09375        5308414

What one sees very nicely is that all relevant kernels make at least 150 GB/s now!

Now what I do not understand: If one adds up all the times from the profile, one gets roughly 2.3 s. However, the solver (without profiling) needs ~3.4 s, which is almost 50% more! One can see this if one calculates the effective BW achieved in the solver, it is ~110 GB/s. Shouldn`t it be at least 150 GB/s? Where does the solver loose time?

If I do 2000 iterations (arbitrarily), the profiling data sums up to ~9s, while the solver needs 11.5s (30% more).
If I increase to 10000 iterations, the kernels alone sum up to ~45s while the solver needs 55s (22% more).

Clearly, the number of iterations seems to improve on the time loss, perhaps because of the larger total workload?

Updated by Christopher Pinke almost 3 years ago

Matthias, do you have an idea what is the origin of these observations?
Of course, I know that one can not simply add up the profiling data and compare the time to the "real" case, but I guess that 50% loss is a bit drastic.

Could it just be the startup time?
I know that you had something similar while you were investigating the program with the profiler.

But if it is, I do not understand why the merged kernel slows the solver down, it is one kernel less (that performs better than the two unmerged ones!).

Updated by Christopher Pinke almost 3 years ago

Actually, I think the point here could be that looking at the Flops may be misleading, the Bandwidth usage should be a better quantity.
It may be that I did some mistakes getting the BW numbers above.
I will implement this properly (see #727) and then we can see.

Also available in: Atom PDF