Efficient CUDA development requires managing data movement between the CPU (Host) and GPU (Device) with precision. Success in parallel programming depends on understanding how parameters reach the hardware and using proper error-checking to keep applications stable.
Passing Parameters to Kernels
In CUDA, kernel parameters are passed via the execution configuration syntax <<<...>>>. While the syntax resembles a standard C++ function call, the hardware handles memory differently depending on whether you are passing values or memory addresses.
Technical Mechanics:
- Pass by Value: When passing primitive types (like int or float), the values are copied directly into the kernel's constant memory. These are read-only within the kernel and do not require manual memory management.
- Pass by Pointer: To process large datasets, you must pass pointers to Device Memory. These pointers must be allocated using cudaMalloc and populated via cudaMemcpy before the kernel can access them.
Example: This example demonstrates passing an integer by value and a pointer to an allocated memory space on the GPU.
%%cuda
#include <iostream>
#include <cuda_runtime.h>
__global__ void parameterKernel(int *d_out, int val) {
// val is passed by value (copied to constant memory)
// d_out is a pointer to global memory
*d_out = val * 10;
}
int main() {
int *d_data;
int h_result = 0;
// Allocate memory on the GPU
cudaMalloc((void**)&d_data, sizeof(int));
// Launch kernel with 1 block and 1 thread
// Passing d_data (pointer) and 5 (value)
parameterKernel<<<1, 1>>>(d_data, 5);
// Copy result back to Host to verify
cudaMemcpy(&h_result, d_data, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "Result: " << h_result << std::endl;
cudaFree(d_data);
return 0;
}
Output
Result: 50
Explanation:
- __global__ void parameterKernel defines the function to run on the GPU.
- cudaMalloc() reserves space in the GPU's Global Memory for the integer result.
- parameterKernel<<<1, 1>>> initiates the kernel; the value 5 is copied to the GPU while the address of d_data is used to store the computation.
- cudaMemcpy() transfers the computed result from the Device back to the Host for printing.
Error Handling
CUDA operations are largely asynchronous, meaning CPU continues execution before the GPU finishes its task. This can mask crashes or memory failures. To resolve this, developers use cudaGetErrorString to translate numeric error codes into human-readable descriptions.
Asynchronous Problem
Because the Host (CPU) does not wait for the Device (GPU) by default, a kernel could fail due to an "Illegal Memory Access," but the CPU might report a "Success" for subsequent commands. Effective error handling requires checking both the launch status and the execution completion.
Example: The following code demonstrates how to catch an error specifically, an invalid configuration where we try to launch a kernel with zero threads.
%%cuda
#include <iostream>
#include <cuda_runtime.h>
int main() {
// Intentional error: launching with 0 threads per block
cudaError_t err = cudaLaunchKernel((void*)0, dim3(0), dim3(0), NULL, 0, 0);
// Capture the last error occurred
err = cudaGetLastError();
if (err != cudaSuccess) {
// Use cudaGetErrorString to convert the error code to a string
std::cout << "CUDA Error encountered: " << cudaGetErrorString(err) << std::endl;
}
return 0;
}
Output
CUDA Error encountered: invalid device function
Explanation:
- cudaError_t specialized type used to store return codes from CUDA API calls.
- cudaGetLastError() function that retrieves the status of the most recent CUDA operation.
- cudaGetErrorString(err) converts the technical error enum (e.g., cudaErrorInvalidConfiguration) into a readable message like "invalid configuration argument."
- if (err != cudaSuccess) conditional check used to determine if the hardware or driver encountered an issue.