The CUDA backend offloads tensor operations to NVIDIA GPUs using CUDA kernels and cuBLAS. It also supports AMD GPUs via ROCm (HIP) and Moore Threads GPUs via MUSA, using the same API.
Requirements
- NVIDIA GPU with CUDA Compute Capability 5.0 or later
- CUDA Toolkit (nvcc, runtime libraries)
- CMake 3.17+
When built with ROCm support (GGML_HIP=ON), the backend name becomes "ROCm" and cuBLAS is replaced by hipBLAS. The C API is identical.
Build
cmake -B build -DGGML_CUDA=ON
cmake --build build
Useful CMake options:
| Option | Default | Description |
|---|
GGML_CUDA=ON | OFF | Enable the CUDA backend |
GGML_CUDA_FORCE_MMQ=ON | OFF | Use mmq kernels instead of cuBLAS |
GGML_CUDA_FORCE_CUBLAS=ON | OFF | Always use cuBLAS instead of mmq kernels |
GGML_CUDA_NO_PEER_COPY=ON | OFF | Disable direct GPU-to-GPU peer copies |
GGML_CUDA_FA=ON | ON | Compile FlashAttention CUDA kernels |
CMAKE_CUDA_ARCHITECTURES | auto | Target GPU architectures, e.g. "89;90" |
Initialization
#include "ggml-cuda.h"
// Initialize on CUDA device 0
ggml_backend_t backend = ggml_backend_cuda_init(0);
if (!backend) {
fprintf(stderr, "failed to initialize CUDA backend\n");
return 1;
}
To select a device, you can enumerate available devices first:
int n = ggml_backend_cuda_get_device_count();
for (int i = 0; i < n; i++) {
char desc[256];
size_t free, total;
ggml_backend_cuda_get_device_description(i, desc, sizeof(desc));
ggml_backend_cuda_get_device_memory(i, &free, &total);
printf("device %d: %s — %.1f / %.1f GB free\n",
i, desc, free / 1e9, total / 1e9);
}
ggml_backend_t backend = ggml_backend_cuda_init(0);
Buffer types
The CUDA backend provides three buffer types:
// Standard device buffer (VRAM)
ggml_backend_buffer_type_t buft = ggml_backend_cuda_buffer_type(device);
// Pinned host buffer — faster CPU↔GPU transfers
ggml_backend_buffer_type_t host_buft = ggml_backend_cuda_host_buffer_type();
// Split buffer — distributes tensor rows across multiple GPUs
float tensor_split[4] = { 0.5f, 0.5f, 0.0f, 0.0f }; // 50/50 split across devices 0 and 1
ggml_backend_buffer_type_t split_buft = ggml_backend_cuda_split_buffer_type(0, tensor_split);
Use the pinned host buffer type for tensors on the CPU side of a CPU/GPU pipeline. Pinned (page-locked) memory transfers to and from the GPU significantly faster than pageable memory.
Multi-GPU setup
To use more than one GPU, create a separate backend per device and pass them all to the scheduler:
int n_devices = ggml_backend_cuda_get_device_count();
ggml_backend_t backends[GGML_CUDA_MAX_DEVICES];
for (int i = 0; i < n_devices; i++) {
backends[i] = ggml_backend_cuda_init(i);
}
// Add CPU as a fallback
ggml_backend_t cpu = ggml_backend_cpu_init();
backends[n_devices] = cpu;
ggml_backend_sched_t sched = ggml_backend_sched_new(
backends, NULL, n_devices + 1, GGML_DEFAULT_GRAPH_SIZE, false, true
);
For weight tensors spread across GPUs, allocate them into a split buffer:
// Equal split across all devices
float split[GGML_CUDA_MAX_DEVICES] = {0};
for (int i = 0; i < n_devices; i++) split[i] = 1.0f / n_devices;
ggml_backend_buffer_type_t split_buft =
ggml_backend_cuda_split_buffer_type(0, split);
ggml_backend_buffer_t weights_buf =
ggml_backend_buft_alloc_buffer(split_buft, weights_size);
Pinned host memory
For faster host-side memory, register existing host buffers with the CUDA driver:
void * host_ptr = malloc(buffer_size);
ggml_backend_cuda_register_host_buffer(host_ptr, buffer_size);
// ... use host_ptr with ggml_backend_tensor_set/get ...
ggml_backend_cuda_unregister_host_buffer(host_ptr);
free(host_ptr);
API summary
| Function | Description |
|---|
ggml_backend_cuda_init(device) | Create a CUDA backend for the given device index |
ggml_backend_is_cuda(backend) | Check whether a backend is a CUDA backend |
ggml_backend_cuda_get_device_count() | Number of available CUDA devices |
ggml_backend_cuda_get_device_description(dev, buf, size) | Human-readable device name |
ggml_backend_cuda_get_device_memory(dev, free, total) | Available and total VRAM in bytes |
ggml_backend_cuda_buffer_type(device) | VRAM buffer type for a device |
ggml_backend_cuda_host_buffer_type() | Pinned host memory buffer type |
ggml_backend_cuda_split_buffer_type(main_dev, split) | Row-split buffer across multiple GPUs |
ggml_backend_cuda_register_host_buffer(ptr, size) | Pin an existing host allocation |
ggml_backend_cuda_unregister_host_buffer(ptr) | Unpin a previously registered allocation |
ggml_backend_cuda_reg() | Return the CUDA backend registry entry |
GGML_CUDA_MAX_DEVICES is 16. You cannot create backends for more than 16 CUDA devices in a single process.