Application scaling

A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more cores will automatically execute the program in less time than a GPU with fewer cores. This is important since we can see here two levels of nested data parallelism or data parallelism nested within task parallelism. The upper level parallelism partitions a given problem into blocks of threads. Each block of thread will run on a compute unit, for example, a SIMD engine in the AMD APUs. Beyond this high level parallelism there is lower level parallelism, where a group of threads run cooperatively within the thread block. Each of these threads runs on the processing elements of the compute unit.

Application scaling

Less cores more time, Courtesy NVIDIA®