CPU backend

The CPU backend is ggml’s built-in execution target. It requires no external dependencies, works on every supported platform, and is always available as a fallback when no GPU backend is present.

Initialization

#include "ggml-cpu.h"

ggml_backend_t backend = ggml_backend_cpu_init();
if (!backend) {
    fprintf(stderr, "failed to initialize CPU backend\n");
    return 1;
}

You can also use the generic backend selector, which returns the CPU backend when no GPU is found:

// Returns the best GPU, or CPU if none is available
ggml_backend_t backend = ggml_backend_init_best();

// Always returns the CPU backend
ggml_backend_t cpu = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, NULL);

Call ggml_backend_load_all() before using ggml_backend_init_best() or ggml_backend_init_by_type() so that all compiled-in backends are registered.

Thread configuration

The CPU backend parallelises operations across threads. You control the thread count after initialization:

// Set the number of threads for graph compute
ggml_backend_cpu_set_n_threads(backend, 8);

Custom thread pool

For finer control — including thread affinity and NUMA-awareness — create a ggml_threadpool and attach it:

#include "ggml-cpu.h"

struct ggml_threadpool_params tp_params = ggml_threadpool_params_default(8);
struct ggml_threadpool * pool = ggml_threadpool_new(&tp_params);

ggml_backend_cpu_set_threadpool(backend, pool);

// When done:
ggml_threadpool_free(pool);

Thread pool management functions:

Function	Description
`ggml_threadpool_new(params)`	Create a thread pool with the given parameters
`ggml_threadpool_free(pool)`	Destroy the thread pool
`ggml_threadpool_get_n_threads(pool)`	Query the thread count
`ggml_threadpool_pause(pool)`	Suspend worker threads
`ggml_threadpool_resume(pool)`	Resume suspended threads

NUMA support

On systems with multiple NUMA nodes, initialise ggml’s NUMA support before creating backends:

// Choose a strategy appropriate for your system
ggml_numa_init(GGML_NUMA_STRATEGY_DISTRIBUTE);

Strategy	Description
`GGML_NUMA_STRATEGY_DISABLED`	No NUMA awareness (default)
`GGML_NUMA_STRATEGY_DISTRIBUTE`	Distribute threads across nodes
`GGML_NUMA_STRATEGY_ISOLATE`	Pin all threads to one node
`GGML_NUMA_STRATEGY_NUMACTL`	Honour `numactl` binding from the shell
`GGML_NUMA_STRATEGY_MIRROR`	Mirror allocation across nodes

SIMD optimisations

ggml detects CPU features at runtime and selects the most capable implementation for each operation. You can query which extensions are available:

x86
ARM
Other

// Returns 1 if the CPU supports the extension, 0 otherwise
ggml_cpu_has_avx()        // AVX
ggml_cpu_has_avx2()       // AVX2
ggml_cpu_has_avx512()     // AVX-512F
ggml_cpu_has_avx512_vnni()// AVX-512 VNNI
ggml_cpu_has_avx512_bf16()// AVX-512 BF16
ggml_cpu_has_avx_vnni()   // AVX-VNNI
ggml_cpu_has_fma()        // FMA3
ggml_cpu_has_f16c()       // F16C (CVT16)
ggml_cpu_has_amx_int8()   // Intel AMX INT8
ggml_cpu_has_bmi2()       // BMI2

ggml_cpu_has_neon()       // NEON SIMD
ggml_cpu_has_arm_fma()    // ARM FMA
ggml_cpu_has_dotprod()    // SDOT/UDOT dot-product
ggml_cpu_has_matmul_int8()// SMMLA/UMMLA int8 matmul
ggml_cpu_has_sve()        // Scalable Vector Extension
ggml_cpu_get_sve_cnt()    // SVE vector length in bytes
ggml_cpu_has_sme()        // Scalable Matrix Extension
ggml_cpu_has_fp16_va()    // FP16 vector arithmetic

ggml_cpu_has_riscv_v()    // RISC-V Vector Extension
ggml_cpu_get_rvv_vlen()   // RVV vector length in bytes
ggml_cpu_has_vsx()        // PowerPC VSX
ggml_cpu_has_vxe()        // IBM z Vector Extensions
ggml_cpu_has_wasm_simd()  // WebAssembly SIMD

You do not need to call these functions to get SIMD acceleration — ggml selects the best path automatically. Use them only if you need to log or assert specific capabilities.

Abort callback

You can register a callback that the CPU backend will call periodically during graph compute. Return true to abort execution:

bool my_abort(void * data) {
    return should_cancel; // return true to stop computation
}

ggml_backend_cpu_set_abort_callback(backend, my_abort, NULL);

Reference implementations

For debugging or correctness testing, force the backend to use unoptimised scalar code:

ggml_backend_cpu_set_use_ref(backend, true);

Build configuration

The CPU backend is compiled into ggml unconditionally. No additional CMake flags are required. SIMD paths are enabled automatically when the target compiler supports them.

cmake -B build
cmake --build build

To target a specific architecture on x86:

# Enable AVX2 and FMA explicitly
target_compile_options(ggml PRIVATE -mavx2 -mfma)

API summary

Function	Description
`ggml_backend_cpu_init()`	Create a CPU backend instance
`ggml_backend_is_cpu(backend)`	Check whether a backend is the CPU backend
`ggml_backend_cpu_set_n_threads(backend, n)`	Set the thread count
`ggml_backend_cpu_set_threadpool(backend, pool)`	Attach a custom thread pool
`ggml_backend_cpu_set_abort_callback(backend, cb, data)`	Register an abort callback
`ggml_backend_cpu_set_use_ref(backend, use_ref)`	Force reference (scalar) implementations
`ggml_backend_cpu_reg()`	Return the CPU backend registry entry

​Initialization

​Thread configuration

​Custom thread pool

​NUMA support

​SIMD optimisations

​Abort callback

​Reference implementations

​Build configuration

​API summary

Initialization

Thread configuration

Custom thread pool

NUMA support

SIMD optimisations

Abort callback

Reference implementations

Build configuration

API summary