Fix local reductions (Defect #462)


Added by Christopher Pinke over 6 years ago. Updated over 6 years ago.


Status:Feedback Start date:23 Apr 2013
Priority:Normal Due date:
Assignee:Matthias Bach % Done:

100%

Category:-
Target version:-

Description

The reduction on the local groups are currently unsafe.
This is because it is implicitely assumed that the local size is <= then 128.
If this is not the case, the first piece, where the idea is to get rid of all threads higher then 64, may have a race-condition:

        if (idx >= 64) {
            result_local[idx % 64].re += result_local[idx].re;
            result_local[idx % 64].im += result_local[idx].im;
        }

Ie if ls = 256, threads 128 and 192 want to add their result to that one of thread 0.
Apparently, it depends on the order of the thread execution if there is a race-condition, but this is very dangerous.

The same problem holds for the next steps in the reduction:

        if (idx >= 32) {
            result_local[idx - 32].re += result_local[idx].re;
            result_local[idx - 32].im += result_local[idx].im;
        }

The aim here is that thread 32-63 add their results to those of threads 0-31, which they do.
However, all the other threads perform that task too, altough they should not. Ie threads 64-96 add their results to threads 32-64, one has a race-condition again. Therefore, this step should have

        if (idx >= 32 && idx <64) {

to be safe.

The other pieces should be safe if changed accordingly.

As an alternative, I implemented another reduction a long time ago (to be found in the polyakovloop kernel).

    //reduction
    if(local_size == 1) {
        ((out))[group_id].re += tmp_pol.re;
        ((out))[group_id].im += tmp_pol.im;
    } else {
        //wait for all threads to end calculations
        barrier(CLK_LOCAL_MEM_FENCE);

        //!!CP: this should be checked by someone else than me
        //perform reduction
        out_loc[idx].re = tmp_pol.re;
        out_loc[idx].im = tmp_pol.im;
        int cut1;
        int cut2 = local_size;
        for(cut1 = local_size / 2; cut1 > 0; cut1 /= 2) {
            for(int i = idx + cut1; i < cut2; i += cut1) {
                ((out_loc)[idx]).re +=  ((out_loc)[i]).re;
                ((out_loc)[idx]).im +=  ((out_loc)[i]).im;
            }
            //!!CP: is this dangerous inside a for-loop?
            barrier(CLK_LOCAL_MEM_FENCE);
            cut2 = cut1;
        }
        if(idx == 0) {
            out[group_id].re = out_loc[0].re;
            out[group_id].im = out_loc[0].im;
        }
    }

This should be checked by someone else then me, but I think this works (after all, the polyakovloop works, and I tested this shortly in the inverter, too).

The idea is that the inner for-loop is exited depending on the idx of the thread, ie it is similar to the if with two conditions from above.
But since this implementation depends on the local_size explicitely, one does not have possible race conditions.

For example, suppose one has 256 threads = cut2. In the first step, cut1 = 128, that means threads 128 to 255 exit the second for-loop right away. Their results are fetched by threads 0-127 with a[idx] += a[idx+128]. After that, also threads 0-127 exit the inner for-loop, just because then i = idx + cut1 + cut1 = idx + cut2, which is always >= cut2. The next iteration of the outer loop is similar but with cut2 = 128 and cut1 = 64. Just like in the first iteration, all threads with idx >= 64 do nothing in this round, so there are no race conditions.

The outer loop runs until thread 0 has all results. In terms of performance one could stop at 8 or so and then do the summation explicitely, like it is done in the other reduction.
This could also be important regarding the one drawback of this implementation: The foor loops may not be as performant as direct codings.

However, I think we should implement the second way, since this is safe for all ls.


On_Reduction.pdf - Comments about gaugeobservables_polyakov.cl code. (56.1 kB) Alessandro Sciarra, 23 Apr 2013 03:34 pm


Related issues

related to CL2QCD - Unit Test #465: Implement reduction tests Done 23 Apr 2013
related to CL2QCD - Defect #437: Local reduction unsafe Done 08 Mar 2013

Associated revisions

Revision 875c360d
Added by Christopher Pinke over 6 years ago

added proper local reduction to sf_squarenorm
refs #462

Revision cafb7d5a
Added by Christopher Pinke over 6 years ago

added proper local reduction to sf_eo_squarenorm
refs #462

Revision 6c1b20e7
Added by Christopher Pinke over 6 years ago

added proper local reduction to spinorfield_scalar_product
refs #462

Revision 6d203ced
Added by Christopher Pinke over 6 years ago

added proper local reduction to spinorfield_eo_scalar_product
refs #462

Revision 45e393ca
Added by Christopher Pinke over 6 years ago

added proper local reduction to plaquette
refs #462

Revision 46a8e923
Added by Christopher Pinke over 6 years ago

added proper local reduction to rectangles
refs #462

Revision 0e687fae
Added by Christopher Pinke over 6 years ago

added proper local reduction to gm squarenorm
refs #462

Revision 98915ae0
Added by Christopher Pinke over 6 years ago

added proper local reduction to polyakov
refs #462

Revision 2e1bd251
Added by Alessandro Sciarra over 6 years ago

Replaced local reduction for staggered with the new version.
refs #462 #427

History

Updated by Christopher Pinke over 6 years ago

What do you think?

  • Status changed from In Progress to Feedback

Updated by Alessandro Sciarra over 6 years ago

I widely checked the code from gaugeobservables_polyakov.cl and I wrote my considerations in the enclosed file. The main result is that this method is reliable if and only if local_size is a power of 2.

Updated by Christopher Pinke over 6 years ago

Alessandro, thank you very much for these considerations, you have a very good point here. Of course, I only thought of numbers which are a multiple of 2 when thinking about the code (this justifies line 9 in your writeup). This will be the case in most of the applications, however, it need not be.

I came up with a solution, I think it should work.

    //reduction
    if(local_size == 1) {
        ((out))[group_id].re += tmp_pol.re;
        ((out))[group_id].im += tmp_pol.im;
    } else {
        //wait for all threads to end calculations
        barrier(CLK_LOCAL_MEM_FENCE);

        //!!CP: this should be checked by someone else than me
        //perform reduction
        out_loc[idx].re = tmp_pol.re;
        out_loc[idx].im = tmp_pol.im;
        int factor = 1;
        int cut0 = local_size;
        for(;;cut0 /= 2){
            if(cut0 % 2 != 0) break;
        }
        cut0 *= factor;
        if( cut0 > local_size)
               cut0 /= factor;
        int cut1;
        int cut2 = local_size;
        for(cut1 = local_size / 2; cut1 > cut0; cut1 /= 2) {
            for(int i = idx + cut1; i < cut2; i += cut1) {
                ((out_loc)[idx]).re +=  ((out_loc)[i]).re;
                ((out_loc)[idx]).im +=  ((out_loc)[i]).im;
            }
            //!!CP: is this dangerous inside a for-loop?
            barrier(CLK_LOCAL_MEM_FENCE);
            cut2 = cut1;
        }
        if(idx == 0) {
            out[group_id].re = out_loc[0].re;
            out[group_id].im = out_loc[0].im;
            for(int i = 1; i < cut0; i++){
                out[group_id].re += out_loc[i].re;
                out[group_id].im += out_loc[i].im;
        }
    }

Here, I introduced a third cut, cut0. The idea is that this is ls without powers of 2, in the example of ls=6 it should be 3, whereas for ls = 8, it is 1. Then, the outer loop exists if cut1 is smaller then cut0, ie for ls=8 nothing changes, and for ls=3 the race-condition should be avoided.
In the end, thread 0 collects all cut0 remaining results. Again, for ls=8 this should be the same as before, for ls=3 thread 0 should collect 3 results.

Since one could think of breaking the for-loops at some point anyway, for example if only 8 threads (out of 128) are left and let these results be collected by one thread (this is exactly what thread 0 does here), I introduced the "factor", which effectively reduces the number of iterations of the outer loop.

Please check this, but I think it should work.
Of course one has to see if the introduced comp. overhead here is useful or if one should just exclude local sizes which are not powers of 2 by hand.

Updated by Matthias Bach over 6 years ago

I would personally not add any additional complications to a kernel just to make it possible to use it with every possible local size. It is completely valid to specify a kernel to only be usable with certain local sizes, just as you also specify that the output buffer has to be the same size as there are numbers of groups.

Updated by Christopher Pinke over 6 years ago

I replaced all local reductions (except in staggered kernels) by a form similar to the one above.
The difference is that I sum the last 8 results in a hardcoded fashion.
I added a comment to the kernels that they should only be used with ls a power of 2 and bigger than 8.

  • % Done changed from 0 to 100

Also available in: Atom PDF