Introducing RadixAttention to Trellis

Lobsters Hottest Tools

Summary

Trellis introduces RadixAttention to optimize LLM inference prefill phase by caching prefix tokens using a radix tree, reducing redundant computation in chat and agentic sessions.

<p><a href="https://lobste.rs/s/g5opue/introducing_radixattention_trellis">Comments</a></p>
Original Article
View Cached Full Text

Cached at: 06/03/26, 07:42 AM

# Introducing RadixAttention to Trellis Source: [https://trellis.unfoldml.com/blog/radix-attention-intro](https://trellis.unfoldml.com/blog/radix-attention-intro) We created Trellis to democratize LLM inference without making compromises on the data privacy of its users\. Towards that goal, we built a system that users can deploy on hardware they already own and operate, i\.e\. laptops, workstations and servers alike\. To meet users where they are, we must accommodate more or less performant hardware, and therefore take optimization opportunities whenever possible\. In this post, we will focus on how we optimized the*prefill*phase of LLM inference in Trellis\. We'll start by giving a brief background on the problem, discuss our implementation of a caching strategy that is relevant for chat\-based and agentic LLM sessions, and conclude by showing some benchmark results\. ## LLM prefill and the KV cache The compute nodes of a Trellis grid perform the numerical operations of a LLM transformer block \(attention, linear layers etc\.\)\. Before the numerics can evaluate a result \(ie\. sample the next token\), we must populate all the operands with coefficients; this is the*prefill phase*of LLM inference, and it is compute\-bound\. Concretely, to produce token \\\(t\+1\\\) , attention needs the keys and values \(the projected embeddings\) of every prior token \(tokens \\\(0 \.\. t\\\) \)\. The first key observation: in autoregressive generation, a sampled token is appended to the already generated ones, and the combined sequence is fed back into the model as input\. This fact, combined with the causal attention mask, means that the embeddings \(and therefore the K and V matrices\) at position \\\(i < j\\\) do not depend on those at later positions \\\(k \\geq j\\\) , and can therefore be*cached*, append\-only, while producing all later tokens\. ## Prefix caching with RadixAttention The second key observation follows from the actual use of LLMs, i\.e\. chat sessions where a system prompt precedes the user request: repeated LLM requests based on the same template must prefill the model with a common prefix string, which means we can precompute and cache all the token embeddings of the system prompt\. Not only that, but in principle we could cache all*prefixes*of a prompt, if we know these will be reused*across*requests\. The key insight of[RadixAttention](https://www.lmsys.org/blog/2024-01-17-sglang/)was to use a specialized data structure \(a[radix tree](https://en.wikipedia.org/wiki/Radix_tree)\) that minimizes storage in the common case when string prefixes share whole substrings \(e\.g\. given the strings "hello my name is Alice", "hello my name is Bob", a radix tree would only store 3 entries: "hello my name is ", "Alice", "Bob"\)\. ## RadixAttention in Trellis In a Trellis grid, there can be multiple nodes making concurrent inference requests, which in turn are served by multiple nodes, each holding one or more transformer blocks\. As such, the KV caching mechanism also serves as a concurrency mechanism, since unrelated requests may share a prefix \(e\.g if a team reuses the same system prompt\)\. Our implementation of the RadixAttention KV cache is**block\-paged**: K/V activations are stored in fixed\-size*blocks*of token embeddings, drawn from a shared*pool*, and each session holds an ordered list of block IDs rather than a contiguous array\. When two sessions share a prefix, the second session**acquires**the existing prefix blocks \(a refcount bump\) instead of**allocating**and recomputing fresh ones\. ## Benchmarks ## Setup and definitions All benchmarks below run**TinyLlama 1\.1B Q4\_0**in\-process on a single Mac M1, over synthetic prompts\. The recurring parameters are: - **L**— prompt length, in tokens\. - **D**—*decode budget*: the number of tokens generated per session\. - **N**— number of concurrent decode sessions\. We use one shared "system prompt" that is warmed once, plus N divergent user suffixes, one per session\. - **φ \(phi\)**— the*shared\-prefix fraction*: what fraction of each prompt is the common prefix\. φ=0\.5 means the first half of every prompt is identical across calls\. Note that φ is**block\-quantized**: internally the shared region is rounded to a whole number of 32\-token blocks \(`prefixBlocks = round\(φ·L/32\)`\), so the realized φ snaps to multiples of 32/L\. - **op**— one full benchmark*iteration*: building a fresh inference pipeline, warming the shared prefix, and running all N sessions to completion\. So "allocations/op" is the heap traffic of an entire iteration,**not**of a single token\. ## Throughput vs shared prefix fraction ![As the common prefix of two prompts gets larger in proportion, both generation throughput and memory allocations steadily improve.](https://trellis.unfoldml.com/static/blog/throughput-phi.png)As the common prefix of two prompts gets larger in proportion, both generation throughput and memory allocations steadily improve\.Prompt length L : 512 tokens, Decode budget D : 64 tokens/session How much do we gain in terms of generation throughput and memory allocation by using RadixAttention with prompts that share a common prefix? Turns out, quite a bit\. The**allocations**curve \(the red trace\) shows that caching works as designed : as \\\(\\phi\\\) rises, the number of heap allocations per iteration falls monotonically, since a larger share of the prefix blocks are*acquired*from the pool \(a refcount bump\) rather than*allocated*fresh, and the prefill positions that those blocks cover no longer have to be recomputed\. Reassuringly, the**throughput**curve \(in blue\) also improves linearly with increasing \\\(\\phi\\\) since cached KV positions need not be prefilled \(the fluctuations at \\\(\\phi\\\) 0\.75 and 0\.9 are likely due to a noisy benchmarking environment\)\. ## Time\-to\-first\-token ![The latency of subsequent request that share a common prefix also improves linearly as a function of the proportion of equal tokens.](https://trellis.unfoldml.com/static/blog/ttft-phi.png)The latency of subsequent request that share a common prefix also improves linearly as a function of the proportion of equal tokens\.Here we want to know how effective RadixAttention is at eliminating redundant compute when prompt prefixes are shared across LLM calls\. TTFT is measured by prewarming the cache with one prompt, then decoding one token for a single request \(concurrency = 1\)\. TTFT falls steadily as \\\(\\phi\\\) rises, because a larger shared prefix means more positions are served from cached blocks and fewer are prefilled from scratch\. Two notes on reading this figure\. First, the large variance at low values of \\\(\\phi\\\) is a first\-iteration artifact : the very first iteration pays a one\-time cold\-start cost\. Note that even at the right end of the grid the maximum achievable hit rate is`\(L/32 − 1\)/\(L/32\)`, not 1\.0; the probe prompt always diverges from the warm prompt in its final block, so one block is never shared\. ## Conclusion In this post we demonstrated how RadixAttention lets Trellis nodes run 30\-40% faster and use that much less memory; these advances are likely to compound as the trend of longer LLM sessions continues\. Trellis is still in its early days, and the design space of a distributed, concurrent LLM runtime is huge\. If you have thought about this kind of problems before and wish to share some feedback,[we would love to hear from you](https://trellis.unfoldml.com/contact)\! ## Appendix : Choosing a cache block size We choose the block size heuristically, to balance bookkeeping and latency\. The next figure helped us pick a default block size B = 32 \. ![Number of leaves of the radix tree cache and TTFT vs cache block size. The error bars account for varying target sequence lengths](https://trellis.unfoldml.com/static/blog/blocksize-sweep.png)Number of leaves of the radix tree cache and TTFT vs cache block size\. The error bars account for varying target sequence lengths

Similar Articles

Wall Attention (GitHub Repo)

TLDR AI

Wall Attention is a new attention variant with per-channel, per-timestep multiplicative decay, providing content-dependent forgetting rates and efficient training/decode kernels implemented in Triton.

LACE: Lattice Attention for Cross-thread Exploration

arXiv cs.AI

LACE introduces a lattice attention mechanism that enables concurrent reasoning paths in LLMs to share intermediate insights and correct errors during inference, improving reasoning accuracy by over 7 points compared to standard isolated parallel sampling.

Lighthouse Attention (11 minute read)

TLDR AI

Lighthouse Attention is a selection-based hierarchical attention mechanism that accelerates long-context pretraining by running forward+backward passes ~17× faster at 512K context and delivering 1.4–1.7× end-to-end speedup at 98K context, validated with Llama-3 530M on 50B tokens.

Interdomain Attention: Beyond Token-Level Key-Value Memory

arXiv cs.LG

Proposes Interdomain Attention, a new method that integrates state space models into attention via kernel methods, achieving efficient long-context modeling with a fixed-size state and outperforming SSMs and softmax attention in language modeling experiments up to 1.3B parameters.