@IntuitMachine: PEEK: The 1k-Token Map That Just Killed the Long-Context Tax Your LLM agent is reading the same 50k-token codebase for …
Summary
Microsoft introduces PEEK, a 1,024-token 'context map' that caches orientation knowledge for LLM agents, cutting redundant reasoning and achieving up to 34% accuracy gains with 93–145 fewer retries and 5.8× cost reduction.
View Cached Full Text
Cached at: 05/23/26, 04:13 PM
PEEK: The 1k-Token Map That Just Killed the Long-Context Tax
Your LLM agent is reading the same 50k-token codebase for the 20th time.
It still doesn’t know where anything is.
PEEK from @Microsoft just changed that with a 1k-token “context map” that:
• ↑ 34% accuracy • ↓ 93–145 fewer retries • 5.8× cheaper than prompt tuning
Here’s how:
Every time you ask GPT-5 a new question about the same repo, it re-discovers:
→ File structure → Key classes → How modules connect
You’re paying for the same orientation work. Again. And again.
Industry calls this “the long-context tax.”
PEEK’s breakthrough:
Separate “context understanding” from “task execution.”
Instead of stuffing everything into the prompt or retrieving blindly, agents now maintain a tiny persistent map — like a cheat-sheet they write once and reuse forever.
The Context Map has 5 sections: Context Roadmap — high-level structure Context Understanding — key entities/relationships Domain Constants (if needed) Parsing Schemas Reusable Results (cached answers)
Budget: exactly 1,024 tokens.
Three modules keep it fresh without bloat:
Distiller → Extracts only transferable orientation knowledge Cartographer → Makes clean, deduplicated edits (ADD/DELETE/REPLACE) Evictor → Drops low-priority items when budget fills
Separation matters: mixed roles = noise + duplication.
Tested on OOLONG + CL-bench (coding benchmarks):
MetricGain vs. ACE (SOTA)Accuracy+6–34%Iterations saved93–145 fewerCost reduction1.4–5.8× cheaper
Same base model. Same agent. Just 1k tokens of orientation cache.
Here’s the efficiency secret:
Freeze the map after 1–4 queries.
You get 80%+ of the gains but near-zero maintenance cost after that. Most “learning” systems never stop updating → wasted compute. PEEK learns fast, then locks in.
How PEEK beats the field:
RAG: retrieves fragments, no holistic structure Summarization: compresses content, not orientation ACE/prompt tuning: optimizes tasks, not context understanding PEEK: caches the mental model your agent should have built on day 1
Devil’s advocate:
PEEK wins when context is structured and queries recur.
If you’re writing one-off creative fiction or chatting about random PDFs, the map has less to cache. But for repos, enterprise docs, analytics? This is the new baseline.
Traditional stack: → Bigger context windows → Better retrieval → Smarter prompts New stack: → Bigger context windows → Better retrieval → Persistent orientation caches
Context understanding just became a first-class versioned artifact.
Two multipliers you can stack today:
PEEK-style maps (↓ redundant reasoning) KV-cache optimizations (↓ redundant token processing) Combine them = multiplicative inference savings.
The next wave of agent infra will bake both in by default.
If you’re building agents that interact with the same long contexts repeatedly:
→ Stop re-engineering prompts every query → Start caching orientation knowledge
The 1k-token map is the missing cache layer. Use it.
/end
Similar Articles
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
This paper introduces PEEK, a system that caches orientation knowledge about recurring external contexts as a context map, enabling LLM agents to reuse context knowledge across invocations and significantly improving efficiency and accuracy on long-context reasoning and information aggregation tasks.
@ickma2311: Efficient AI Lecture 15: Long-Context LLM Long context is not just a bigger prompt window. The key question is: which p…
This post summarizes Efficient AI Lecture 15 on long-context LLMs, covering RoPE position interpolation for context extension, the needle-in-haystack evaluation, and StreamingLLM's attention sink phenomenon and KV cache eviction strategy.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.
Deepseek V4's 1M context window: the breaking point
A detailed evaluation of Deepseek V4's 1M token context window across production codebases reveals optimal performance at 150-250k tokens, with degradation past 300k and significant latency in reasoning mode. The model exhibits high hallucination rates on unknown tasks, requiring validation layers for production use.
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.