@TheTuringPost: Why KV cache is one of the main reasons LLMs are fast? KV cache is what connects attention mechanism with generation st…
Summary
KV cache stores previously computed key and value vectors during autoregressive generation, allowing models to avoid recomputing the entire sequence at each step, significantly speeding up inference at the cost of increased memory usage.
View Cached Full Text
Cached at: 05/25/26, 08:48 AM
Why KV cache is one of the main reasons LLMs are fast?
KV cache is what connects attention mechanism with generation stage of autoregressive models. These models generate text token by token, but each new token still attends to all previous ones.
→ To optimize decode phase, models store previously computed key and value vectors in a KV cache. → During generation, they only compute new Q/K/V states for the latest token and attend over cached past representations.
Without KV cache, the model would recompute keys and values for the entire sequence at every step (like token 501 recomputes tokens 1–500), that’s very slow.
But the tradeoff of KV cache is memory, because it grows with sequence length, batch size, layers, and attention heads.
That’s why so much research today targets KV efficiency and memory optimization. For example:
- Upgrading attention mechanism, since it influences how KV cache is formed. Use more advanced attention like CompactAttention, MHA, MLA, etc. based on your needs.
- Improve memory management. System needs to identify what to store long-term or keep local, when to summarize, and when to trim.
You can learn more about KV cache + attention here: https://turingpost.com/p/your-ultimate-guide-to-attention-mechanism-qkv-and-kv-cache… And how they fit into the full LLM inference pipeline here: https://turingpost.com/p/llm-inference-from-tokens-to-answers…
AI 101: Your Ultimate Guide to Attention: Mechanism, QKV, and KV Cache
Source: https://www.turingpost.com/p/your-ultimate-guide-to-attention-mechanism-qkv-and-kv-cache Quick answer: What is attention in AI?
Attention in AI is the mechanism that lets Transformer models decide which parts of an input sequence should influence each token’s representation. It compares queries, keys, and values to build contextual meaning, powers self-attention, and enables faster generation through KV cache reuse.
**TL;DR:**Attention in AI lets Transformer models dynamically decide which tokens matter most for understanding context. Using queries, keys, and values, attention computes relationships between tokens, enabling contextual representations, efficient long-context processing, and modern capabilities like reasoning, translation, retrieval, and autoregressive text generation.
What is attention in AI? This question may sound simple, but there are actually many aspects worth a deep discussion.
In the previous episodes, we exploredwhat a token in AI isand followed tokens as they becameembeddings: dense vectors, that are also points in a learned space and coordinates with semantic structure. Embeddings are where meaning lives, but coordinates alone are not enough. The next natural step in our series isattention. So what is its role in the workflow?
- Tokenization breaks language into units.
- Embedding turns those units into learnable geometry.
- Attention makes that geometry interact.
Attention solves a harder task than previous stages.: for this token, in this sentence, at this exact moment, which other tokens are relevant enough to change its meaning?**The overall mechanism lets one token attend to another token, decide how relevant it is, and pull in exactly the information it needs.**At this moment a sequence stops being a row of vectors and becomes context.
This mechanism became an indispensable foundation and changed AI once and forever, especially becoming the central computation part of Transformers.
So let’s unpack where attention comes from, how it works, the core components and types of this mechanism, and why attention isn’t the same as understanding.It’s foundational knowledge.
In today’s episode:
- The history of attention: from translation to Transformers
- How attention works - From embeddings to contextual representations - Queries, keys, and values: the QKV mechanism - How to compute attention step-by-step?
- What is KV cache and why does it matter?
- How attention mechanisms are evolving
- Why attention is not the same as understanding
- Sources and further reading
The history of attention: from translation to Transformers
Before attention became the center of the Transformer described in one of the most influential papers in AI –*Attention is All You Need*from Google, it had appeared three years earlier as a solution for a practical problem in neural machine translation.
Early encoder–decoder models encoded an entire source sentence into a single fixed-length vector – a dense numerical representation that captures the sentence’s meaning – and decoded the translation from that compressed summary. Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio in their*Neural Machine Translation by Jointly Learning to Align and Translate*paper (2014) argued that this fixed-length context vector created a bottleneck, because the model had to compress all relevant information into one representation. They proposed a model that could align source and target words during decoding,letting the decoder softly search over different parts of the input sentence while generating each new word. The decoder dynamically computes a context vector as a weighted combination of source annotations, focusing on the most relevant parts of the input for the current prediction. This was the moment when context became adaptive and target-dependent.
Then in 2015, Stanford researchers published*Effective Approaches to Attention-based Neural Machine Translation*by Minh-Thang Luong, Hieu Pham and Christopher D. Manning that extended this idea with practical attention mechanisms. They introducedglobal attention, where the decoder attends to all source positions, andlocal attention, where it focuses only on a smaller window of source words at each step. Plus, they proposed input-feeding approach, where the model feeds earlier attention information back into later steps so it can remember what parts of the source it has already focused on.
Image Credit: Global attention mechanism in Effective Approaches to Attention-based Neural Machine Translation
Image Credit: Local attention window Effective Approaches to Attention-based Neural Machine Translation
Then finally came the main breakthrough –the architecture built around attention itself. Yes, it is aboutTransformersandAttention Is All You Needpaper (2017) by A. Vaswani et al. The Google researchers removed recurrence and convolutions and made self-attention layers the core foundation of Transformers. They also introduced the formulation of attention and the language of queries, keys, and values that became the canonical way to describe how models retrieve and combine contextual information.
This is just the story of how attention became the centerpiece of modern models. It’s not enough to understand the basics. Let’s go through every part of the workflow step-by-step.
How attention works
From embeddings to contextual representations
The usual Transformer model starts with tokenembeddings(dense numerical vector representations of tokens that encode semantic and syntactic information in a continuous vector space),combined with positional encodings, because the word order matters a lot in this architecture. At this stage, these vectors are still not deeply contextual. They have information about token identity, and with positional encoding they “know” something about location, but they do not yet define what matters to what.
So these “token embeddings + positional encodings” vectors serve as the inputs to the attention mechanism, entering the Transformer’s attention layers.
Image Credit: Transformer architecture showing self-attention layers and positional encodings, Attention is All You Need
There, each token representation is transformed intoqueries, keys, and values. These are linear projections of embeddings or hidden states – context-aware vector representations of tokens as they are processed layer by layer inside the model. Embeddings provide the raw material from which attention builds its comparisons. Without embeddings, the mechanism would have nothing meaningful to compare or route through the network.
Queries, keys, and values: the QKV mechanism
Here is the essential vocabulary that everyone needs to know to understand the full mechanism and attention formulation.
Concept
Simple meaning
Why it matters
Query
What the current token is looking for
Starts the relevance search
Key
What each token exposes about itself
Lets other tokens decide whether to attend to it
Value
The information passed forward if selected
Becomes the attention output
Self-attention
Tokens attending to other tokens in the same sequence
Builds contextual representations
Multi-head attention
Several attention operations running in parallel
Captures different relationships at once
KV cache
Stored keys and values from previous tokens
Speeds up generation and reduces recomputation
After embeddings and positional information enter the model, each token vector is multiplied by learned weight matrices to produce three different versions of itself:
- **A query (Q)**is what the current token is looking for. It is the signal used to compare against other tokens and determine which ones may be relevant.
- **A key (K)**is information that every token exposes about itself so others can decide whether to attend to it. It acts like a tag or description attached to a token.
- **A value (V)**represents the information a token contributes if attention selects it. It is the actual content being passed along, like a payload. A weighted combination of value vectors becomes the attention output.
But why can’t we just use ordinary embeddings, and why do we need to split them into these three vectors?
The answer is simple: a token does not have just one job. A model may need one representation for searching, another for matching, and another for carrying information forward. One word can simultaneously be a requester, a candidate match, and a carrier of content. Without this separation for Q, K, and V, attention would collapse multiple roles into one vector and lose its flexibility.
How to compute attention step-by-step?
Now it’s time to move on to the core aspect – computation. A classic formulation for attention looks like this:
Image Credit: Attention Is All You Need
Let’s figure out what is so important here.
- For each token, the model compares its query (Q) to the keys (K) of other tokens. This operation is computed for all tokens simultaneously using matrices Q, K, and V. The dot product between a Q and K measures compatibility. This produces attention scores.
- Those scores are scaled by √dₖ that keeps gradients in a realistic range. This scaling is needed because when dimensions grow large, raw dot products can become too large in magnitude as well, which makes softmax overly sharp and training less stable.
- Then softmax – a mathematical function that converts a set of numbers into probabilities that sum to 1 – turns similarity scores into attention weights, determining how strongly each token should influence the result.
- Finally, the model uses those weights to compute a weighted sum of the value (V) vectors. The result is a new representation of the token that has pulled in the most relevant information from the sequence.
In decoder self-attention, the model also applies masking, preventing tokens from attending to future positions. This preserves the autoregressive behavior of text generation, where each token can depend only on earlier tokens.
**So this is the workflow in easy words:a token enters as a vector, asks a question, scans the sequence for relevant keys, gathers weighted values, and exits as a richer vector. The process repeats for every token and across every Transformer layer. We can also callattention a lookup mechanism that learns relevance in a high-dimensional space.**And this is the moment where embeddings become contextual representations refined layer by layer.
Interestingly, in a**Multi-Head Attention (MHA)**variant, this workflow is performed several times in parallel using different learned projections of Q, K, and V. Different attention heads (independent attention operation) can focus on different kinds of relationships and representation subspaces, like local dependencies, long-range dependencies, and structural patterns. The outputs of all heads are then fused and projected again into a single representation.
Image Credit: Self-attention and Multi-head attention processing different token relationships in parallel, Attention Is All You Need
What is KV cache and why does it matter?
In Transformers, text generation basically works one token at a time, and when generating the next token, the model still needs access to all previous tokens. If it recomputes their keys and values at every step (for example, for token 501 it recomputes everything for tokens 1-500 from scratch), this would make the decoding very slow.
So there is a solution: systems store the previously computed key and value vectors, formingKV cache. Models compute only new query, key, and value vectors for the current token, then attends over the cached representations from previous positions. In practice, past queries aren’t needed.
But in this case the main tradeoff becomes memory: KV cache grows with sequence length, batch size, number of layers, and in standard multi-head attention, with the number of attention heads, since each head maintains its own cached key and value states.
This is why so much attention research focuses on making KV storage more efficient.**Multi-Query Attention (MQA)shares one key-value head across all query heads.Grouped-Query Attention(GQA)**uses several KV groups, keeping much of MQA’s speed while staying closer to full multi-head quality.**Cross-Layer Attention (CLA)shares KV activations across neighboring Transformer layers, reducing KV cache memory use by up to 2x compared to MQA and GQA. AndMulti-Head Latent Attention (MLA)**compresses KV states into lower-rank latent vectors. It was first introduced with the release of DeepSeek-V2, reducing KV cache by 93.3% in comparison to the previous DeepSeek model.
How attention mechanisms are evolving
If you want to explore the core and most popular attention mechanisms, including FlashAttention algorithm optimized for the hardware (GPUs) that uses fast on-chip memory and the ones that we have already mentioned here, we recommend you to check out this list with research papers provided →
And here, we’ll focus on some of the most interesting recent attention variants instead.
- **Elastic core-periphery attention:**It is an alternative building block for Vision Transformers working at high resolution. The model routes communication through a small set of learned “core” tokens. There are also patch tokens which interact only with the cores, while the cores communicate with each other. This reduces attention complexity from quadratic to more linear with image size.
- **Mixture of Indexer Sparse Attention (MISA)**is a sparse attention mechanism for very long contexts to make it faster and cheaper. It dynamically selects only a small subset of the most useful heads for each query. A lightweight router decides which heads to activate based on coarse block-level statistics, and only those heads perform the expensive token-level search.
- **Block-Filtered Long-Context Attention (BFLA)**has a very similar idea to MISA but sparsifies along the token/block axis. It first groups tokens into coarse blocks and estimates their importance, and then computes full token-level attention only inside the selected regions.
- **Latent-Condensed Attention (LCA)**compresses context inside the latent space used by Multi-head Latent Attention (MLA). It merges groups of similar semantic vectors into compact representations while separately preserving positional information through anchor tokens. This lets the model reduce attention computation and KV cache size at the same time.

- **Low-Rank Key-Value Attention (LRKV)**is another interesting mechanism. It reduces KV cache memory by making attention heads share one main key-value representation while adding small low-rank head-specific corrections on top.
So the core directions in attention research are moving toward more aggressive reduction of KV cache and specifically adapted for the very long context. If you look deeper, this trend makes a lot of sense: model reasoning chains are now much longer, agents keep much more context in memory, context windows keep growing, but at the same time we still want the speed and efficiency of much lighter models. And probably, we are still waiting for a new breakthrough in this field.
Why attention is not the same as understanding
Before we finish, let’s clarify one more important point. Attention is often described using human metaphors, butAI attention is about the mechanics and compute. Itdoes not “understand” information the way humans do, relying on memory, awareness, and intent. Attention in AI just computes which parts of the input should influence the output more strongly.
If a model sees the sentence “The big red ball was under the table” and is asked “Where was the red ball?”, attention helps it focus on the relevant parts – “red ball” and “under the table.” But if asked “Where was the blue balloon?”, the model may or may not detect the inconsistency depending on its training and internal representations, while a human immediately notices the mismatch.
That is why attention is not the same as understanding. Attention helps the model stay grounded in context, but it doesn’t guarantee correctness. It is just a part of the stack that makes models more relevant. Of course, without attention, models would never have reached this level of performance. However, attention is still ultimately a mechanism – a tool for prioritizing input that helps models stay on track.
Sources and further reading
- Attention Is All You Need |Paper
- Neural Machine Translation by Jointly Learning to Align and Translate |Paper
- Effective Approaches to Attention-based Neural Machine Translation |Paper
- Fast Transformer Decoding: One Write-Head is All You Need |Paper
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints |Paper
- Reducing Transformer Key-Value Cache Size with Cross-Layer Attention |Paper
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model |Paper
- Efficient Memory Management for Large Language Model Serving with PagedAttention |Paper
- Elastic Attention Cores for Scalable Vision Transformers |Paper
- MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference | Paper
- BFLA: Block-Filtered Long-Context Attention Mechanism |Paper
- Latent-Condensed Transformer for Efficient Long Context Modeling |Paper
- Low-Rank Key Value Attention |Paper
- DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference |Paper
Resources From Turing Post:
FAQ
What is attention in AI?
Attention is a mechanism used in Transformers that allows models to determine which tokens in a sequence are most relevant when processing information. It dynamically computes relationships between tokens to build contextual understanding.
How do queries, keys, and values work in attention?
Queries represent what a token is searching for, keys describe what each token offers, and values contain the information passed forward. Attention compares queries with keys to decide how much of each value to use.
Why is attention important for Transformers?
Attention allows Transformers to process relationships between tokens in parallel instead of sequentially. This enables better context modeling, scalability, and long-range dependency handling compared to older recurrent architectures.
What is KV cache in LLMs?
KV cache stores previously computed key and value vectors during autoregressive generation so the model does not recompute them for every new token. This dramatically speeds up inference in large language models.
Attention vs understanding: what is the difference?
Attention helps models prioritize relevant information in context, but it does not guarantee reasoning, factual correctness, or human-like understanding. It is a computational mechanism, not consciousness or intent.

Similar Articles
@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…
This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.
@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…
DeepSeek's KV cache compression innovations, including MLA and CSA/HCA, reduce KV cache size by 93%, enabling efficient long-context inference and SSD-based caching, as demonstrated by antirez's ds4.c project.
KV Cache Is Becoming the Memory Hierarchy of Inference
The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.
Autoregressive next token prediction and KV Cache in transformers
Explains autoregressive next token prediction in transformers and the KV cache optimization technique used to speed up token generation.
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.