Thread Hierarchy in CUDA

The primary advantage of CUDA programming is the ability to execute thousands of tasks simultaneously. To manage this massive parallelism, NVIDIA uses a structured hierarchy that organizes work into threads, blocks and grids.

Hierarchical Structure

CUDA parallelizes tasks by breaking them down into a three-level hierarchy. This structure allows the GPU to scale across different hardware generations seamlessly. If a new GPU has more processing cores, it simply executes more blocks at the same time without requiring you to rewrite your code.

Thread: The smallest unit of execution. Each thread is an individual worker executing a copy of your kernel.
Block: A group of threads that collaborate. Threads within the same block reside on the same physical multiprocessor on the GPU, meaning they can share extremely fast memory and synchronize their work easily.
Grid: A collection of thread blocks. A grid represents the entire workspace for a single kernel launch. Blocks within a grid operate entirely independently of one another.

Grid and Block Dimensions

In many real-world GPU applications, data is not arranged as a simple one-dimensional array. Instead, it often represents structured data such as images (2D), volumetric data (3D), or matrices. To efficiently process such data, CUDA provides built-in support for multidimensional execution configuration.

To map threads naturally to your data's shape, CUDA allows both Grids and Blocks to be structured in up to three dimensions: x-dimension, y-dimension and z-dimension.
This is achieved using the dim3 data type, which is a built-in CUDA structure with three components: .x, .y and .z. If not specified, unspecified dimensions default to 1.

Defining Multidimensional Execution

When launching a kernel, you must provide two architectural parameters to the GPU:

Grid dimensions: How many blocks are in the grid?
Block dimensions: How many threads are in a single block?

Each of these can be defined in 1D, 2D, or 3D depending on the problem:

C++

// A 2D grid of 16x16 blocks
dim3 blocksPerGrid(16, 16);

// Each block contains 32x32 threads
dim3 threadsPerBlock(32, 32);

// 3. Launch the kernel using this structure
myKernel<<<blocksPerGrid, threadsPerBlock>>>();

In this configuration:

The grid has 16 × 16 = 256 blocks
Each block has 32 × 32 = 1024 threads
Total threads launched = 256 × 1024 = 262,144 simultaneous threads.

This structured layout allows CUDA to map threads naturally to corresponding elements in multidimensional datasets.

Understanding Hardware Limits

While CUDA allows for flexible thread organization, there are physical hardware limits to how many threads can exist within a single block:

The 1024 Thread Limit: On almost all modern NVIDIA hardware, a single thread block cannot exceed 1,024 total threads. The sum of x × y × z ≤ 1024
Dimension Caps: Even if the total is under 1,024, there are caps on specific dimensions. Usually, the 'x' and 'y' dimensions of a block can go up to 1,024, but the 'z' dimension is strictly limited to 64.

Note: If you attempt to launch a block with dim3 threadsPerBlock(32, 64), the total threads would equal 2,048. The kernel launch will simply fail.

Accessing Multidimensional Indices

Each thread can uniquely identify its position using built-in variables:

threadIdx.(x, y, z): position within a block
blockIdx.(x, y, z): position of the block within the grid
blockDim.(x, y, z): dimensions of a block
gridDim.(x, y, z): dimensions of the grid

These variables allow threads to determine which portion of the data they should operate on, based on their position in the hierarchy. For example, in a 2D configuration:

int row = threadIdx.y;
int col = threadIdx.x;

Here:

row represents the vertical position
col represents the horizontal position

Note: In larger problems, block and thread indices are often combined to compute a global index, which uniquely identifies each thread across the entire grid.

Example: This example demonstrates how threads in a 2D grid identify their unique row and column positions within their respective blocks.

C++

%%cuda
#include <iostream>
#include <cuda_runtime.h>

__global__ void checkIndexKernel() {

    // Target only local thread (8,8) to keep the console clean.
    if (threadIdx.x == 8 && threadIdx.y == 8) {
        printf("Execution Status: Thread [%d, %d] active in Block (%d, %d)\n", 
        threadIdx.x, threadIdx.y, blockIdx.x, blockIdx.y);
    }
}

int main() {
    
    // Set up a 2x2 Grid, where each block holds 16x16 threads
    dim3 blocksInGrid(2, 2);
    dim3 threadsInBlock(16, 16);

    // Launch the kernel with the specified structure
    checkIndexKernel<<<blocksInGrid, threadsInBlock>>>();
    cudaDeviceSynchronize();
    return 0;
}

Output

Execution Status: Thread [8, 8] active in Block (0, 1)
Execution Status: Thread [8, 8] active in Block (0, 0)
Execution Status: Thread [8, 8] active in Block (1, 0)
Execution Status: Thread [8, 8] active in Block (1, 1)

Explanation:

threadIdx.x and threadIdx.y represent the thread's local "address" inside the current block.
blockIdx.x and blockIdx.y identify which "group" (block) the thread belongs to within the larger grid.
Even though we only told a thread at local position (8,8) to print, we got four print statements. It's because we created a Grid with four distinct blocks (a 2x2 setup). Every single block has its own local thread at (8,8).

Why Multidimensional Structure Matters

Multidimensional indexing provides several advantages:

Aligns naturally with structured data (images, matrices, tensors)
Improves code readability by avoiding manual index calculations
Enables efficient memory access patterns, which is critical for GPU performance
Scales seamlessly for higher-dimensional problems