While setting the number of work groups to be equal to CL_DEVICE_MAX_COMPUTE_UNITS
might be sound advice on some hardware, it certainly is not on NVIDIA GPUs.
On the CUDA architecture, an OpenCL compute unit is the equivalent of a multiprocessor (which can have either 8, 32 or 48 cores), and these are designed to be able to simultanesouly run up to 8 work groups (blocks in CUDA) each. At larger input data sizes, you might choose to run thousands of work groups, and your particular GPU can handle up to 65535 x 65535 work groups per kernel launch.
OpenCL has another device attribute CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
. If you query that on an NVIDIA device, it will return 32 (this is the "warp", or natural SIMD width of the hardware). That value is the work group size multiple you should use; work group sizes can be up to 512 items each, depending on the resources consumed by each work item. The standard rule of thumb for your particular GPU is that you require at least 192 active work items per compute unit (threads per multiprocessor in CUDA terms) to cover all the latency the architecture and potentially obtain either full memory bandwidth or full arithmetic throughput, depending on the nature of your code.
NVIDIA ship a good document called "OpenCL Programming Guide for the CUDA Architecture" in the CUDA toolkit. You should take some time to read it, because it contains all the specifics of how the NVIDIA OpenCL implementation maps onto the features of their hardware, and it will answer the questions you have raised here.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…