[...] it seems that you are using the AMD implementation of OpenCL. I have worked with both the AMD and Nvidia implementations extensively, and it would be safe to say that Nvidia's implementation is much faster and much more completely. The biggest flaw in the AMD implementation I would say is the lack of support for images in OpenCL. This is a driver issue, and they plan on supporting images eventually, but after all the time that passed since the OpenCL standard they still haven't done so! My code uses images, so it would only run on an Nvidia implementation (for now).
Also, as a general remark, I would like to tell you that from experience (and a lot of reading), not all algorithms are faster on the GPU, even those that can be parallelized. Whether or not you get faster speeds relies on many factors. For example, off the top of the my head, performance on the GPU depends on:
1- The speed of the GPU (seems obvious but): Most integrated GPUs and those on standard laptops (not high-end ones) are slower than the CPUs on board. So running an algorithm on those GPUs will prove much slower than running them on the CPUs available.
2- Type of algorithm: If the algorithm requires a lot of data transfer between the CPU and the GPU, this will likely be a huge bottleneck.
3- The GPU manufacturer: For now, Nvidia's implementation is much better than AMD's or Intel's, and this is natural since they got into the GPU computing game much earlier than the rest, and they kind of drew the path for all the rest.
4- If you are working on a mobile robot and computation is done on-board the robot (as opposed to wirelessly on a desktop PC), having a fast-enough GPU on-board is probably not feasible since those consume a lot of power, so it would be hard to procure a battery powerful enough to handle it.
5- In practice (at least in today's technology), the best time to use GPU computation is when you have a desktop PC with a high-end GPU from Nvidia, those that require a larger system power supply, and when you have an algorithm that can be easily parallelized.