malloc/free at inference time. Instead, all tensor metadata and graph structures are allocated from a single fixed-size buffer provided at context creation. Tensor data for GPU or accelerated backends lives in separate backend buffers managed by the backend layer.
The arena allocator
Everyggml_context owns a contiguous memory arena. All ggml_new_tensor_* and ggml_new_graph calls bump-allocate from this arena. When you call ggml_free, the entire arena is released in one shot.
This model means:
- Zero per-tensor allocation overhead during graph execution.
- Predictable memory usage — you know the upper bound at startup.
- No fragmentation.
ggml_init_params
| Field | Description |
|---|---|
mem_size | Total bytes available to the context. Must fit all tensor metadata, graph structs, and (if no_alloc = false) tensor data. |
mem_buffer | Optional externally-owned buffer. Pass NULL to let ggml allocate. |
no_alloc | When true, tensors are created with data = NULL. Use this when tensor data will be allocated by a backend buffer (ggml_gallocr). |
Computing required memory
To size the arena correctly, account for every tensor and the graph struct:ggml_tensor_overhead() and ggml_graph_overhead() return the constant size of the respective structs (including internal alignment padding), so you can always compute an exact upper bound.
Querying used memory
After building the graph you can ask how much of the arena was actually consumed:no_alloc mode and backend buffers
When using hardware backends (CUDA, Metal, Vulkan, …) tensor data must live in device memory — not in the CPU-side arena. The standard pattern is:
- Create the context with
no_alloc = trueso the arena only stores tensor metadata. - Build the graph.
- Use
ggml_gallocrorggml_backend_alloc_ctx_tensorsto allocate device memory for all tensors.
ggml_gallocr — graph-level allocation
ggml_gallocr allocates tensor data for an entire graph in a single backend buffer, using live-range analysis to overlap buffers of non-simultaneously-live tensors.
Multi-backend allocation
When using a backend scheduler that spans multiple devices, pass one buffer type per device:Tensor-level allocation
For finer-grained control,ggml_tallocr allocates individual tensors from a backend buffer:
Allocating all tensors in a context
If you created a context withno_alloc = true and want to allocate all its tensors on a specific backend in one call:
Context lifecycle
ggml_reset lets you reuse the same memory region for a different graph without going through ggml_init again.
Memory layout summary
Context arena
Holds
ggml_tensor structs, ggml_cgraph, and (when no_alloc = false) tensor data. Sized with ggml_tensor_overhead() and ggml_graph_overhead().Backend buffer
Holds actual tensor data in device memory (CPU heap, CUDA VRAM, Metal shared memory, …). Allocated via
ggml_gallocr or ggml_backend_alloc_*.