# Fix local reductions (Defect #462)

**Description**

The reduction on the local groups are currently unsafe.

This is because it is implicitely assumed that the local size is <= then 128.

If this is not the case, the first piece, where the idea is to get rid of all threads higher then 64, may have a race-condition:

if (idx >= 64) { result_local[idx % 64].re += result_local[idx].re; result_local[idx % 64].im += result_local[idx].im; }

Ie if ls = 256, threads 128 and 192 want to add their result to that one of thread 0.

Apparently, it depends on the order of the thread execution if there is a race-condition, but this is very dangerous.

The same problem holds for the next steps in the reduction:

if (idx >= 32) { result_local[idx - 32].re += result_local[idx].re; result_local[idx - 32].im += result_local[idx].im; }

The aim here is that thread 32-63 add their results to those of threads 0-31, which they do.

However, all the other threads perform that task too, altough they should not. Ie threads 64-96 add their results to threads 32-64, one has a race-condition again. Therefore, this step should have

if (idx >= 32 && idx <64) {

to be safe.

The other pieces should be safe if changed accordingly.

As an alternative, I implemented another reduction a long time ago (to be found in the polyakovloop kernel).

//reduction if(local_size == 1) { ((out))[group_id].re += tmp_pol.re; ((out))[group_id].im += tmp_pol.im; } else { //wait for all threads to end calculations barrier(CLK_LOCAL_MEM_FENCE); //!!CP: this should be checked by someone else than me //perform reduction out_loc[idx].re = tmp_pol.re; out_loc[idx].im = tmp_pol.im; int cut1; int cut2 = local_size; for(cut1 = local_size / 2; cut1 > 0; cut1 /= 2) { for(int i = idx + cut1; i < cut2; i += cut1) { ((out_loc)[idx]).re += ((out_loc)[i]).re; ((out_loc)[idx]).im += ((out_loc)[i]).im; } //!!CP: is this dangerous inside a for-loop? barrier(CLK_LOCAL_MEM_FENCE); cut2 = cut1; } if(idx == 0) { out[group_id].re = out_loc[0].re; out[group_id].im = out_loc[0].im; } }

This should be checked by someone else then me, but I think this works (after all, the polyakovloop works, and I tested this shortly in the inverter, too).

The idea is that the inner for-loop is exited depending on the idx of the thread, ie it is similar to the if with two conditions from above.

But since this implementation depends on the local_size explicitely, one does not have possible race conditions.

For example, suppose one has 256 threads = cut2. In the first step, cut1 = 128, that means threads 128 to 255 exit the second for-loop right away. Their results are fetched by threads 0-127 with a[idx] += a[idx+128]. After that, also threads 0-127 exit the inner for-loop, just because then i = idx + cut1 + cut1 = idx + cut2, which is always >= cut2. The next iteration of the outer loop is similar but with cut2 = 128 and cut1 = 64. Just like in the first iteration, all threads with idx >= 64 do nothing in this round, so there are no race conditions.

The outer loop runs until thread 0 has all results. In terms of performance one could stop at 8 or so and then do the summation explicitely, like it is done in the other reduction.

This could also be important regarding the one drawback of this implementation: The foor loops may not be as performant as direct codings.

However, I think we should implement the second way, since this is safe for all ls.

**Related issues**

related to CL2QCD - Unit Test #465: Implement reduction tests | Done | 23 Apr 2013 | ||

related to CL2QCD - Defect #437: Local reduction unsafe | Done | 08 Mar 2013 |

### Associated revisions

added proper local reduction to sf_squarenorm

refs #462

added proper local reduction to sf_eo_squarenorm

refs #462

added proper local reduction to spinorfield_scalar_product

refs #462

added proper local reduction to spinorfield_eo_scalar_product

refs #462

added proper local reduction to plaquette

refs #462

added proper local reduction to rectangles

refs #462

added proper local reduction to gm squarenorm

refs #462

added proper local reduction to polyakov

refs #462

### History

#### Updated by Christopher Pinke over 6 years ago

What do you think?

**Status**changed from*In Progress*to*Feedback*

#### Updated by Alessandro Sciarra over 6 years ago

I widely checked the code from gaugeobservables_polyakov.cl and I wrote my considerations in the enclosed file. The main result is that this method is reliable if and only if **local_size is a power of 2**.

**File**On_Reduction.pdf added

#### Updated by Christopher Pinke over 6 years ago

Alessandro, thank you very much for these considerations, you have a very good point here. Of course, I only thought of numbers which are a multiple of 2 when thinking about the code (this justifies line 9 in your writeup). This will be the case in most of the applications, however, it need not be.

I came up with a solution, I think it should work.

//reduction if(local_size == 1) { ((out))[group_id].re += tmp_pol.re; ((out))[group_id].im += tmp_pol.im; } else { //wait for all threads to end calculations barrier(CLK_LOCAL_MEM_FENCE); //!!CP: this should be checked by someone else than me //perform reduction out_loc[idx].re = tmp_pol.re; out_loc[idx].im = tmp_pol.im; int factor = 1; int cut0 = local_size; for(;;cut0 /= 2){ if(cut0 % 2 != 0) break; } cut0 *= factor; if( cut0 > local_size) cut0 /= factor; int cut1; int cut2 = local_size; for(cut1 = local_size / 2; cut1 > cut0; cut1 /= 2) { for(int i = idx + cut1; i < cut2; i += cut1) { ((out_loc)[idx]).re += ((out_loc)[i]).re; ((out_loc)[idx]).im += ((out_loc)[i]).im; } //!!CP: is this dangerous inside a for-loop? barrier(CLK_LOCAL_MEM_FENCE); cut2 = cut1; } if(idx == 0) { out[group_id].re = out_loc[0].re; out[group_id].im = out_loc[0].im; for(int i = 1; i < cut0; i++){ out[group_id].re += out_loc[i].re; out[group_id].im += out_loc[i].im; } }

Here, I introduced a third cut, cut0. The idea is that this is ls without powers of 2, in the example of ls=6 it should be 3, whereas for ls = 8, it is 1. Then, the outer loop exists if cut1 is smaller then cut0, ie for ls=8 nothing changes, and for ls=3 the race-condition should be avoided.

In the end, thread 0 collects all cut0 remaining results. Again, for ls=8 this should be the same as before, for ls=3 thread 0 should collect 3 results.

Since one could think of breaking the for-loops at some point anyway, for example if only 8 threads (out of 128) are left and let these results be collected by one thread (this is exactly what thread 0 does here), I introduced the "factor", which effectively reduces the number of iterations of the outer loop.

Please check this, but I think it should work.

Of course one has to see if the introduced comp. overhead here is useful or if one should just exclude local sizes which are not powers of 2 by hand.

#### Updated by Matthias Bach over 6 years ago

I would personally not add any additional complications to a kernel just to make it possible to use it with every possible local size. It is completely valid to specify a kernel to only be usable with certain local sizes, just as you also specify that the output buffer has to be the same size as there are numbers of groups.

#### Updated by Christopher Pinke over 6 years ago

I replaced all local reductions (except in staggered kernels) by a form similar to the one above.

The difference is that I sum the last 8 results in a hardcoded fashion.

I added a comment to the kernels that they should only be used with ls a power of 2 and bigger than 8.

**% Done**changed from*0*to*100*