examples/simple directory contains two minimal programs that each multiply two matrices using ggml. They demonstrate the two main approaches to memory and compute management:
simple-ctx
Context-based allocation. All tensors and the compute graph live in a single
ggml_context. Simple to use; CPU-only.simple-backend
Backend-based allocation. Separates graph definition from execution. Supports CPU, CUDA, Metal, and other backends.
A × Bᵀ for two matrices and print the result.
Context-based approach (simple-ctx.cpp)
This is the legacy API. Memory for tensors and the compute graph is allocated inside a single ggml_context using a fixed-size memory pool.
Calculate and allocate the memory pool
Before creating any tensors you must calculate the total memory needed and pass it to Setting
ggml_init:no_alloc = false means tensor data buffers are allocated immediately inside the memory pool.Create tensors and copy data
Allocate 2D tensors and copy the input matrices into them:
Note the argument order for
ggml_new_tensor_2d: dimensions are (cols, rows) because ggml uses column-major storage.Build the compute graph
Describe the computation by connecting tensors with operations.
ggml_mul_mat(a, b) computes A × Bᵀ:ggml_build_forward_expand walks the tensor dependency tree and records all nodes needed to produce result.Full source
Backend-based approach (simple-backend.cpp)
The backend API separates graph definition from execution and works with any ggml backend — CPU, CUDA, Metal, and others. The key difference is that tensor data is allocated by the backend scheduler after the graph is built, not inside a context.
Initialize backends
Load all available backends and create a scheduler that picks the best device:The scheduler runs each graph node on the highest-priority backend that supports the operation, falling back to CPU for unsupported ops.
Build the compute graph with no_alloc = true
Create a temporary context only to define the graph structure. Set
no_alloc = true because the scheduler will allocate tensor data later:Allocate and upload tensor data
Let the scheduler allocate backend memory, then upload the input data:
Full source
Choosing an approach
Context-based (simple-ctx) | Backend-based (simple-backend) | |
|---|---|---|
| Device support | CPU only | CPU, CUDA, Metal, Vulkan, … |
| Memory management | Single pre-allocated pool | Scheduler allocates per-backend |
| Data transfer | Direct memcpy | ggml_backend_tensor_set / _get |
| Complexity | Lower | Higher |
| When to use | Prototyping, CPU-only tools | Production, GPU acceleration |
