Backends overview

ggml separates the description of a computation graph from its execution. A backend is a pluggable execution target — CPU cores, a CUDA device, Apple Silicon GPU, or a remote machine. You write one graph-building routine and ggml dispatches it to whatever hardware is available.

Core types

Type	Description
`ggml_backend_t`	A live execution stream on a specific device
`ggml_backend_buffer_t`	A memory allocation owned by a backend
`ggml_backend_buffer_type_t`	A factory for creating buffers of a specific kind
`ggml_backend_dev_t`	A discoverable hardware device
`ggml_backend_reg_t`	A backend registration entry (groups devices of the same type)
`ggml_backend_sched_t`	A multi-backend scheduler

ggml_backend_t

ggml_backend_t is an opaque handle to an initialized backend instance. It holds an execution stream and is the primary object you pass to graph compute calls.

ggml_backend_t backend = ggml_backend_cuda_init(0); // device 0
// ... run graphs ...
ggml_backend_free(backend);

ggml_backend_buffer_t and ggml_backend_buffer_type_t

Buffers hold the raw memory for tensors. A buffer type (ggml_backend_buffer_type_t) is a descriptor that tells ggml where and how to allocate memory. You get one from a backend and use it to allocate buffers:

ggml_backend_buffer_type_t buft = ggml_backend_get_default_buffer_type(backend);
ggml_backend_buffer_t buf = ggml_backend_buft_alloc_buffer(buft, size_in_bytes);
// ... assign tensors into buf ...
ggml_backend_buffer_free(buf);

Buffer usage hints let the scheduler make better decisions:

// Mark a buffer as holding model weights
ggml_backend_buffer_set_usage(buf_weights, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);

ggml_backend_dev_t and device discovery

Every registered backend exposes one or more ggml_backend_dev_t objects. You can enumerate all available devices at runtime:

ggml_backend_load_all(); // load all compiled-in backends

size_t count = ggml_backend_dev_count();
for (size_t i = 0; i < count; i++) {
    ggml_backend_dev_t dev = ggml_backend_dev_get(i);
    struct ggml_backend_dev_props props;
    ggml_backend_dev_get_props(dev, &props);
    printf("%s: %s (%.1f GB free)\n",
           props.name,
           props.description,
           props.memory_free / 1e9);
}

Device types are defined by ggml_backend_dev_type:

Enum value	Meaning
`GGML_BACKEND_DEVICE_TYPE_CPU`	CPU using system memory
`GGML_BACKEND_DEVICE_TYPE_GPU`	Discrete GPU with dedicated memory
`GGML_BACKEND_DEVICE_TYPE_IGPU`	Integrated GPU using host memory
`GGML_BACKEND_DEVICE_TYPE_ACCEL`	Accelerator used alongside the CPU (e.g. BLAS, AMX)

Convenience initializers select a backend without enumerating devices manually:

// Best available GPU, or CPU if no GPU is found
ggml_backend_t backend = ggml_backend_init_best();

// First device of a specific type
ggml_backend_t cpu = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, NULL);

// Backend by name (e.g. "CUDA0", "Metal")
ggml_backend_t named = ggml_backend_init_by_name("CUDA0", NULL);

The backend scheduler

ggml_backend_sched_t lets you run a single computation graph across multiple backends simultaneously. The scheduler:

Assigns each graph node to the backend that best supports the operation
Copies tensors between backends automatically when needed
Allocates compute buffers on each backend
Prioritises backends with a lower index in the array you supply

Tensors allocated in buffers marked GGML_BACKEND_BUFFER_USAGE_WEIGHTS are preferentially assigned to whichever backend owns those weights.

ggml_backend_t backends[2] = { gpu_backend, cpu_backend };
ggml_backend_sched_t sched = ggml_backend_sched_new(
    backends,
    NULL,                      // use default buffer types
    2,                         // number of backends
    GGML_DEFAULT_GRAPH_SIZE,   // max nodes in graph
    false,                     // parallel splits
    true                       // op offload
);

The scheduler API follows a straightforward lifecycle:

Reserve (optional)

Pass a representative max-size graph to pre-allocate buffers. This avoids allocation at compute time.

struct ggml_cgraph * measure_graph = build_graph(sched, MAX_BATCH);
ggml_backend_sched_reserve(sched, measure_graph);

Reset

Clear allocations from the previous graph before computing a new one.

ggml_backend_sched_reset(sched);

Allocate

Explicitly allocate the graph (skipped automatically on first compute).

ggml_backend_sched_alloc_graph(sched, graph);

Set inputs

Copy data into the allocated input tensors.

ggml_backend_tensor_set(input_tensor, host_data, 0, nbytes);

Compute

Execute the graph. Returns a ggml_status value.

ggml_backend_sched_graph_compute(sched, graph);

Read outputs

Copy results back to host memory.

ggml_backend_tensor_get(result, out_data, 0, nbytes);

Complete example

The following is drawn directly from examples/simple/simple-backend.cpp and shows the full lifecycle — backend selection, graph construction, scheduling, and result retrieval.

#include "ggml.h"
#include "ggml-backend.h"

struct simple_model {
    struct ggml_tensor * a {};
    struct ggml_tensor * b {};
    ggml_backend_t backend {};
    ggml_backend_t cpu_backend {};
    ggml_backend_sched_t sched {};
    std::vector<uint8_t> buf;
};

void init_model(simple_model & model) {
    ggml_backend_load_all();

    // Pick the best available GPU, fall back to CPU
    model.backend = ggml_backend_init_best();
    model.cpu_backend = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, nullptr);

    ggml_backend_t backends[2] = { model.backend, model.cpu_backend };
    model.sched = ggml_backend_sched_new(backends, nullptr, 2,
                                         GGML_DEFAULT_GRAPH_SIZE, false, true);
}

struct ggml_cgraph * build_graph(simple_model & model) {
    size_t buf_size = ggml_tensor_overhead() * GGML_DEFAULT_GRAPH_SIZE
                    + ggml_graph_overhead();
    model.buf.resize(buf_size);

    struct ggml_init_params params = {
        .mem_size   = buf_size,
        .mem_buffer = model.buf.data(),
        .no_alloc   = true,
    };
    struct ggml_context * ctx = ggml_init(params);
    struct ggml_cgraph  * gf  = ggml_new_graph(ctx);

    model.a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 2, 4);
    model.b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 2, 3);

    struct ggml_tensor * result = ggml_mul_mat(ctx, model.a, model.b);
    ggml_build_forward_expand(gf, result);
    ggml_free(ctx);
    return gf;
}

struct ggml_tensor * compute(simple_model & model, struct ggml_cgraph * gf) {
    ggml_backend_sched_reset(model.sched);
    ggml_backend_sched_alloc_graph(model.sched, gf);

    ggml_backend_tensor_set(model.a, matrix_A, 0, ggml_nbytes(model.a));
    ggml_backend_tensor_set(model.b, matrix_B, 0, ggml_nbytes(model.b));

    ggml_backend_sched_graph_compute(model.sched, gf);
    return ggml_graph_node(gf, -1);
}

int main(void) {
    simple_model model;
    init_model(model);

    struct ggml_cgraph * gf = build_graph(model);
    struct ggml_tensor * result = compute(model, gf);

    std::vector<float> out(ggml_nelements(result));
    ggml_backend_tensor_get(result, out.data(), 0, ggml_nbytes(result));

    ggml_backend_sched_free(model.sched);
    ggml_backend_free(model.backend);
    ggml_backend_free(model.cpu_backend);
}

Available backends

Backend	Platforms	Hardware	Build flag
CPU	All	x86, ARM, RISC-V, PowerPC	Always available
CUDA	Linux, Windows	NVIDIA GPUs	`-DGGML_CUDA=ON`
Metal	macOS 13+	Apple Silicon, AMD GPUs	`-DGGML_METAL=ON`
Vulkan	Linux, Windows, Android	Cross-vendor GPUs	`-DGGML_VULKAN=ON`
OpenCL	Linux, Windows, Android	AMD, Intel, Qualcomm	`-DGGML_OPENCL=ON`
SYCL	Linux	Intel GPUs, oneAPI	`-DGGML_SYCL=ON`
RPC	All	Remote devices	`-DGGML_RPC=ON`

CPU backend

SIMD-optimised execution on x86 and ARM with configurable thread pools.

CUDA backend

NVIDIA GPU acceleration with multi-GPU and split-tensor support.

Metal backend

Native Apple GPU compute for macOS and Apple Silicon.

Vulkan backend

Cross-vendor GPU support for Linux, Windows, and Android.

RPC backend

Distribute computation to remote machines over the network.

​Core types

​ggml_backend_t

​ggml_backend_buffer_t and ggml_backend_buffer_type_t

​ggml_backend_dev_t and device discovery

​The backend scheduler

​Complete example

​Available backends