@MaximeRivest: current llm architecture is stupid (if not stupid its, at least, wasteful). take these 3 prompts of 4 context chunks: […
Summary
A tweet criticizes current LLM architecture for wasteful recomputation due to order-dependent context, and proposes encoding context units separately to enable order-invariant, efficient caching and generation.
View Cached Full Text
Cached at: 06/10/26, 03:53 PM
current llm architecture is stupid (if not stupid its, at least, wasteful).
take these 3 prompts of 4 context chunks:
[file A][file B][task][tool specs] [tool specs][file B][file A][task] [task][file B][file A][tool specs]
Their order should be irrelevant, inside these context fields, order is crucial BUT NOT outside.
This breaks cache and causes us to recompute a lot of things when a single file in a code base changes, only the generation task changes (not the files nor the tools), etc.
Also, its harder to cheaply trim context to the essential. Harder to retrieve context chunk and show only the relevant ones in a way that save compute.
Is should be possible to encode these context chunks so that this is possible:
u1=encode(Unit(name=“file_a.py”, content=…)) u2=encode(Unit(name=“file_b.py”, content=…)) u3=encode(Unit(name=“tool_specs.yaml”, content=…))
model.generate( task=“provide diff for fixing file_a.py file_b.py is irrelevant”, ctx_units=[u1, u2, u3] )
In this case u2 (file_b) would have been previously encoded and its impact on the flops for the task should quickly fizzle out as the early layer of the neural net figure out the its irrelevant to the task.
And, while u1 and u3 are both relevant their order is not.
Has anybody trained something like that?
It feels like rich late interaction for generation.
Similar Articles
@ickma2311: Efficient AI Lecture 15: Long-Context LLM Long context is not just a bigger prompt window. The key question is: which p…
This post summarizes Efficient AI Lecture 15 on long-context LLMs, covering RoPE position interpolation for context extension, the needle-in-haystack evaluation, and StreamingLLM's attention sink phenomenon and KV cache eviction strategy.
Quoting Bryan Cantrill
Bryan Cantrill critiques LLMs for lacking the optimization constraint of human laziness, arguing that LLMs will unnecessarily complicate systems rather than improve them, and highlighting how human time limitations drive the development of efficient abstractions.
Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention [P]
Sebastian Raschka reviews recent innovations in LLM architectures focused on long-context efficiency, including KV sharing, compressed convolutional attention, and layer-wise attention budgeting from models like Gemma 4, ZAYA1, Laguna XS.2, and DeepSeek V4.
Beyond Compaction: Structured Context Eviction for Long-Horizon Agents
Introduces Context Window Lifecycle (CWL), a structured context eviction scheme for long-horizon LLM agents that maintains an effectively unbounded working horizon by evicting content based on a dependency graph, avoiding the limitations of summarization-based compaction and recency truncation.
LLMs and Memory Limitations - review my thoughts pls
An analysis of LLM memory limitations, arguing that true personal AI requires single-tenant weight customization which conflicts with current multi-tenant cloud economics, and highlighting open-weight models as the likely source of progress.