ggml_add, ggml_mul_mat, or any other operation, no arithmetic is performed — instead, a new tensor node is allocated that records the operation and its inputs. Actual computation runs only when you call a graph compute function.
This design means:
- The same graph can be executed repeatedly (e.g., for each inference batch) without re-allocation overhead.
- Backends (CPU, CUDA, Metal, …) receive the full graph and can optimize execution order, fuse kernels, and schedule memory.
The ggml_cgraph structure
A computation graph is represented by ggml_cgraph, which tracks:
- nodes — tensors that require computation (operation outputs)
- leafs — tensors with no inputs (parameters, constants)
- grads — gradient tensors, populated after
ggml_build_backward_expand
Full workflow
Step 1 — Initialize a context
Step 2 — Create tensors and define operations
Operations return new tensor nodes but perform no computation:Step 3 — Build the forward graph
ggml_build_forward_expand walks the tensor graph upward from the output node and registers all reachable nodes into gf:
Step 4 — Set input values
Step 5 — Compute
Step 6 — Free
Matrix multiplication example
The following is adapted fromexamples/simple/simple-ctx.cpp:
Marking tensors as inputs and outputs
When using the backend allocator (ggml_gallocr), you should mark tensors explicitly so that the allocator can make better decisions about memory layout:
GGML_TENSOR_FLAG_INPUT and GGML_TENSOR_FLAG_OUTPUT in tensor->flags.
Inspecting the graph
Compute functions reference
ggml_graph_compute_with_ctx
ggml_graph_compute_with_ctx
Convenience wrapper that allocates the work buffer inside the context. Requires that you have reserved enough space in the context for the work buffer.
ggml_graph_plan / ggml_graph_compute
ggml_graph_plan / ggml_graph_compute
Lower-level API that lets you supply your own work buffer.
Backend API (ggml_backend_graph_compute)
Backend API (ggml_backend_graph_compute)
When using a hardware backend, dispatch through the backend scheduler:
