Skip to main content
ggml separates the description of a computation graph from its execution. A backend is a pluggable execution target — CPU cores, a CUDA device, Apple Silicon GPU, or a remote machine. You write one graph-building routine and ggml dispatches it to whatever hardware is available.

Core types

TypeDescription
ggml_backend_tA live execution stream on a specific device
ggml_backend_buffer_tA memory allocation owned by a backend
ggml_backend_buffer_type_tA factory for creating buffers of a specific kind
ggml_backend_dev_tA discoverable hardware device
ggml_backend_reg_tA backend registration entry (groups devices of the same type)
ggml_backend_sched_tA multi-backend scheduler

ggml_backend_t

ggml_backend_t is an opaque handle to an initialized backend instance. It holds an execution stream and is the primary object you pass to graph compute calls.
ggml_backend_t backend = ggml_backend_cuda_init(0); // device 0
// ... run graphs ...
ggml_backend_free(backend);

ggml_backend_buffer_t and ggml_backend_buffer_type_t

Buffers hold the raw memory for tensors. A buffer type (ggml_backend_buffer_type_t) is a descriptor that tells ggml where and how to allocate memory. You get one from a backend and use it to allocate buffers:
ggml_backend_buffer_type_t buft = ggml_backend_get_default_buffer_type(backend);
ggml_backend_buffer_t buf = ggml_backend_buft_alloc_buffer(buft, size_in_bytes);
// ... assign tensors into buf ...
ggml_backend_buffer_free(buf);
Buffer usage hints let the scheduler make better decisions:
// Mark a buffer as holding model weights
ggml_backend_buffer_set_usage(buf_weights, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);

ggml_backend_dev_t and device discovery

Every registered backend exposes one or more ggml_backend_dev_t objects. You can enumerate all available devices at runtime:
ggml_backend_load_all(); // load all compiled-in backends

size_t count = ggml_backend_dev_count();
for (size_t i = 0; i < count; i++) {
    ggml_backend_dev_t dev = ggml_backend_dev_get(i);
    struct ggml_backend_dev_props props;
    ggml_backend_dev_get_props(dev, &props);
    printf("%s: %s (%.1f GB free)\n",
           props.name,
           props.description,
           props.memory_free / 1e9);
}
Device types are defined by ggml_backend_dev_type:
Enum valueMeaning
GGML_BACKEND_DEVICE_TYPE_CPUCPU using system memory
GGML_BACKEND_DEVICE_TYPE_GPUDiscrete GPU with dedicated memory
GGML_BACKEND_DEVICE_TYPE_IGPUIntegrated GPU using host memory
GGML_BACKEND_DEVICE_TYPE_ACCELAccelerator used alongside the CPU (e.g. BLAS, AMX)
Convenience initializers select a backend without enumerating devices manually:
// Best available GPU, or CPU if no GPU is found
ggml_backend_t backend = ggml_backend_init_best();

// First device of a specific type
ggml_backend_t cpu = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, NULL);

// Backend by name (e.g. "CUDA0", "Metal")
ggml_backend_t named = ggml_backend_init_by_name("CUDA0", NULL);

The backend scheduler

ggml_backend_sched_t lets you run a single computation graph across multiple backends simultaneously. The scheduler:
  • Assigns each graph node to the backend that best supports the operation
  • Copies tensors between backends automatically when needed
  • Allocates compute buffers on each backend
  • Prioritises backends with a lower index in the array you supply
Tensors allocated in buffers marked GGML_BACKEND_BUFFER_USAGE_WEIGHTS are preferentially assigned to whichever backend owns those weights.
ggml_backend_t backends[2] = { gpu_backend, cpu_backend };
ggml_backend_sched_t sched = ggml_backend_sched_new(
    backends,
    NULL,                      // use default buffer types
    2,                         // number of backends
    GGML_DEFAULT_GRAPH_SIZE,   // max nodes in graph
    false,                     // parallel splits
    true                       // op offload
);
The scheduler API follows a straightforward lifecycle:
1

Reserve (optional)

Pass a representative max-size graph to pre-allocate buffers. This avoids allocation at compute time.
struct ggml_cgraph * measure_graph = build_graph(sched, MAX_BATCH);
ggml_backend_sched_reserve(sched, measure_graph);
2

Reset

Clear allocations from the previous graph before computing a new one.
ggml_backend_sched_reset(sched);
3

Allocate

Explicitly allocate the graph (skipped automatically on first compute).
ggml_backend_sched_alloc_graph(sched, graph);
4

Set inputs

Copy data into the allocated input tensors.
ggml_backend_tensor_set(input_tensor, host_data, 0, nbytes);
5

Compute

Execute the graph. Returns a ggml_status value.
ggml_backend_sched_graph_compute(sched, graph);
6

Read outputs

Copy results back to host memory.
ggml_backend_tensor_get(result, out_data, 0, nbytes);

Complete example

The following is drawn directly from examples/simple/simple-backend.cpp and shows the full lifecycle — backend selection, graph construction, scheduling, and result retrieval.
#include "ggml.h"
#include "ggml-backend.h"

struct simple_model {
    struct ggml_tensor * a {};
    struct ggml_tensor * b {};
    ggml_backend_t backend {};
    ggml_backend_t cpu_backend {};
    ggml_backend_sched_t sched {};
    std::vector<uint8_t> buf;
};

void init_model(simple_model & model) {
    ggml_backend_load_all();

    // Pick the best available GPU, fall back to CPU
    model.backend = ggml_backend_init_best();
    model.cpu_backend = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, nullptr);

    ggml_backend_t backends[2] = { model.backend, model.cpu_backend };
    model.sched = ggml_backend_sched_new(backends, nullptr, 2,
                                         GGML_DEFAULT_GRAPH_SIZE, false, true);
}

struct ggml_cgraph * build_graph(simple_model & model) {
    size_t buf_size = ggml_tensor_overhead() * GGML_DEFAULT_GRAPH_SIZE
                    + ggml_graph_overhead();
    model.buf.resize(buf_size);

    struct ggml_init_params params = {
        .mem_size   = buf_size,
        .mem_buffer = model.buf.data(),
        .no_alloc   = true,
    };
    struct ggml_context * ctx = ggml_init(params);
    struct ggml_cgraph  * gf  = ggml_new_graph(ctx);

    model.a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 2, 4);
    model.b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 2, 3);

    struct ggml_tensor * result = ggml_mul_mat(ctx, model.a, model.b);
    ggml_build_forward_expand(gf, result);
    ggml_free(ctx);
    return gf;
}

struct ggml_tensor * compute(simple_model & model, struct ggml_cgraph * gf) {
    ggml_backend_sched_reset(model.sched);
    ggml_backend_sched_alloc_graph(model.sched, gf);

    ggml_backend_tensor_set(model.a, matrix_A, 0, ggml_nbytes(model.a));
    ggml_backend_tensor_set(model.b, matrix_B, 0, ggml_nbytes(model.b));

    ggml_backend_sched_graph_compute(model.sched, gf);
    return ggml_graph_node(gf, -1);
}

int main(void) {
    simple_model model;
    init_model(model);

    struct ggml_cgraph * gf = build_graph(model);
    struct ggml_tensor * result = compute(model, gf);

    std::vector<float> out(ggml_nelements(result));
    ggml_backend_tensor_get(result, out.data(), 0, ggml_nbytes(result));

    ggml_backend_sched_free(model.sched);
    ggml_backend_free(model.backend);
    ggml_backend_free(model.cpu_backend);
}

Available backends

BackendPlatformsHardwareBuild flag
CPUAllx86, ARM, RISC-V, PowerPCAlways available
CUDALinux, WindowsNVIDIA GPUs-DGGML_CUDA=ON
MetalmacOS 13+Apple Silicon, AMD GPUs-DGGML_METAL=ON
VulkanLinux, Windows, AndroidCross-vendor GPUs-DGGML_VULKAN=ON
OpenCLLinux, Windows, AndroidAMD, Intel, Qualcomm-DGGML_OPENCL=ON
SYCLLinuxIntel GPUs, oneAPI-DGGML_SYCL=ON
RPCAllRemote devices-DGGML_RPC=ON

CPU backend

SIMD-optimised execution on x86 and ARM with configurable thread pools.

CUDA backend

NVIDIA GPU acceleration with multi-GPU and split-tensor support.

Metal backend

Native Apple GPU compute for macOS and Apple Silicon.

Vulkan backend

Cross-vendor GPU support for Linux, Windows, and Android.

RPC backend

Distribute computation to remote machines over the network.