Proper implementation of reductions (Feature #102)
I implemented a reduction where needed. On each kernel there is a collection of data on the local level with currently two versions: one is a loop and the other is explicitly coded. In the latter it is assumed that the local_work_size is not bigger than 128, which of course has to be adjusted if this is not the case. Perhaps the loops is the saver solution.
After that, a kernel is called where thread 0 collects the local_data on a global level.