# Merged spinor kernels broken (Defect #416)

**Description**

Currently, there is one merge spinor kernel: saxpy + Squarenorm.

The correspondig tests all fail.

**Related issues**

follows CL2QCD - Unit Test #323: Merged fermionmatrix kernels including dslash broken? | New | 25 Jan 2013 | 25 Jan 2013 |

### History

#### Updated by Christopher Pinke over 4 years ago

As of now, the test passes on the GPU in gpu-dev03.

I also implemented the use of the merged kernel in the single-device CG, and it gives the same result as with the unmerged kernels.

Here, I used the setup given in #717.

Note that, strangely, the merged kernels do not increase the speed of the inverter!

**Status**changed from*New*to*Done***% Done**changed from*0*to*100*

#### Updated by Christopher Pinke over 4 years ago

Matthias, do you have any idea why merging the two kernels does not give a speedup? Actually, I get 10 GLOPS less when using the merged kernels in one specific setup.

Naively, one has to read and write one whole spinorfield less. Could it just be the shadowing of kernel execution with more kernels?

**Status**changed from*Done*to*Feedback***Assignee**changed from*Christopher Pinke*to*Matthias Bach*

#### Updated by Matthias Bach over 4 years ago

Did you check for register spilling? The forces-kernel have been split in six ways to avoid register spilling, more than doubling net-bandwidth utilization (even though total bandwidth consumption rises and prevents reaching the peak with the net utilization).

#### Updated by Christopher Pinke over 4 years ago

There is actually no register spilling

```
1[09:49:19] TRACE: Reading information from file saxpy_AND_squarenorm_eo_Tahiti.isa
2[09:49:19] DEBUG: Kernel: saxpy_AND_squarenorm_eo - 34 sGPRs, 118 vGPRs, 0 scratch registers, 32768 bytes statically allocated local memory
```

To give some numbers, performing 2000 iterations of the CG (with the setup from #717) with kernel merging:

```
1[10:22:01] INFO: SOLVER [CG] [002000]: CG completed in 13552 ms @ 59.544 Gflops. Performed 2000 iterations. Performance after warmup: 63.347 Gflops.
```

and without:

```
1[10:22:51] INFO: SOLVER [CG] [002000]: CG completed in 12091 ms @ 66.736 Gflops. Performed 2000 iterations. Performance after warmup: 70.237 Gflops.
```

Hence, without merging the solver is 10% faster!

The performance of the merged kernel is not bad, it makes 123 GB/s, but the original ones are just better (in the cg, one actually uses the scalar-product instead of the squarenorm!):

```
1 saxpy_eoprec 969418 4022 241 0 135.814584734346 22.0239867136777 31.21875 5308416
2 scalar_product_eoprec 220162 2022 108 0 203.138619743643 48.753250370182 21.09375 5308414
3 saxpy_AND_squarenorm_eo 552178 2001 275 0 123.436164309335 29.4563501443375 32.484375 8128510
```

What is completely puzzling is that the avg. time of the merged kernel is smaller than the sum of the two original ones. This means that it should actually be faster! The numbers above yield that the time it takes to do ~2000 times saxpy + scalar_product is ~700k mus, while the merged kernel only takes 550k mus. This is a ~25% effect.

#### Updated by Christopher Pinke over 4 years ago

Matthias, do you have a suggestion how to investigate this behaviour? Perhaps profiling?

#### Updated by Matthias Bach over 4 years ago

This might be related to the local memory usage required for the reduction. This probably causes less groups to be scheduled concurrently and might lower the effectivness of the saxpy, which essentially does little moren than memcpy. Might be worth to play around with group counts and sizes to vary occupancy. Clark hat a very interesting paper to auto-vary those parameters during runtime to find the optimum, similar to how it's done in caldgemm. This is just a guess so, would have to think more deeply into it to give a founded answer.

#### Updated by Christopher Pinke over 4 years ago

I tried to set the sizes of the workgroups manually using *attributes*. This works very well, apparently one should not use 128 but 64 as workgroup size. This gives an 50% increase in the gamma5 and saxpy kernels (eo)!

A typical profile looks like this (for the setup from above):

```
1#device 0 Time [mus] Calls Avg Time [mus] Avg Time/Site [mus] BW [GB/s] FLOPS [GFLOP/s] Re/Wr [MB] FLOP
2gamma5_eo 138355 1016 136 0 155.927885685375 4.87274642766796 20.25 663552
3dslash_eo 1655180 2032 814 0 195.508065201368 110.787903614108 151.875 90243072
4saxpy_eoprec 277045 1508 183 0 178.183074431951 28.8945526105867 31.21875 5308416
5scalar_product_reduction 94530 1009 93 0 0.0220308473500476 196897966469617 0.0019683837890625 18446744073709551615
6scalar_product_eoprec 83686 1008 83 0 266.416690963841 63.9399817412709 21.09375 5308414
```

What one sees very nicely is that all relevant kernels make at least 150 GB/s now!

Now what I do not understand: If one adds up all the times from the profile, one gets roughly 2.3 s. However, the solver (without profiling) needs ~3.4 s, which is almost 50% more! One can see this if one calculates the effective BW achieved in the solver, it is ~110 GB/s. Shouldn`t it be at least 150 GB/s? Where does the solver loose time?

If I do 2000 iterations (arbitrarily), the profiling data sums up to ~9s, while the solver needs 11.5s (30% more).

If I increase to 10000 iterations, the kernels alone sum up to ~45s while the solver needs 55s (22% more).

Clearly, the number of iterations seems to improve on the time loss, perhaps because of the larger total workload?

#### Updated by Christopher Pinke over 4 years ago

Matthias, do you have an idea what is the origin of these observations?

Of course, I know that one can not simply add up the profiling data and compare the time to the "real" case, but I guess that 50% loss is a bit drastic.

Could it just be the startup time?

I know that you had something similar while you were investigating the program with the profiler.

But if it is, I do not understand why the merged kernel slows the solver down, it is one kernel less (that performs better than the two unmerged ones!).

#### Updated by Christopher Pinke over 4 years ago

Actually, I think the point here could be that looking at the Flops may be misleading, the Bandwidth usage should be a better quantity.

It may be that I did some mistakes getting the BW numbers above.

I will implement this properly (see #727) and then we can see.