@MaximeRivest: current llm architecture is stupid (if not stupid its, at least, wasteful). take these 3 prompts of 4 context chunks: […

X AI KOLs Following 06/09/26, 04:25 PM News

Summary

A tweet criticizes current LLM architecture for wasteful recomputation due to order-dependent context, and proposes encoding context units separately to enable order-invariant, efficient caching and generation.

current llm architecture is stupid (if not stupid its, at least, wasteful). take these 3 prompts of 4 context chunks: [file A][file B][task][tool specs] [tool specs][file B][file A][task] [task][file B][file A][tool specs] Their order should be irrelevant, inside these context fields, order is crucial BUT NOT outside. This breaks cache and causes us to recompute a lot of things when a single file in a code base changes, only the generation task changes (not the files nor the tools), etc. Also, its harder to cheaply trim context to the essential. Harder to retrieve context chunk and show only the relevant ones in a way that save compute. Is should be possible to encode these context chunks so that this is possible: u1=encode(Unit(name="file_a.py", content=...)) u2=encode(Unit(name="file_b.py", content=...)) u3=encode(Unit(name="tool_specs.yaml", content=...)) model.generate( task="provide diff for fixing file_a.py file_b.py is irrelevant", ctx_units=[u1, u2, u3] ) In this case u2 (file_b) would have been previously encoded and its impact on the flops for the task should quickly fizzle out as the early layer of the neural net figure out the its irrelevant to the task. And, while u1 and u3 are both relevant their order is not. Has anybody trained something like that? It feels like rich late interaction for generation.

Original Article

View Cached Full Text

Cached at: 06/10/26, 03:53 PM

current llm architecture is stupid (if not stupid its, at least, wasteful).

take these 3 prompts of 4 context chunks:

[file A][file B][task][tool specs] [tool specs][file B][file A][task] [task][file B][file A][tool specs]

Their order should be irrelevant, inside these context fields, order is crucial BUT NOT outside.

This breaks cache and causes us to recompute a lot of things when a single file in a code base changes, only the generation task changes (not the files nor the tools), etc.

Also, its harder to cheaply trim context to the essential. Harder to retrieve context chunk and show only the relevant ones in a way that save compute.

Is should be possible to encode these context chunks so that this is possible:

u1=encode(Unit(name=“file_a.py”, content=…)) u2=encode(Unit(name=“file_b.py”, content=…)) u3=encode(Unit(name=“tool_specs.yaml”, content=…))

model.generate( task=“provide diff for fixing file_a.py file_b.py is irrelevant”, ctx_units=[u1, u2, u3] )

In this case u2 (file_b) would have been previously encoded and its impact on the flops for the task should quickly fizzle out as the early layer of the neural net figure out the its irrelevant to the task.

And, while u1 and u3 are both relevant their order is not.

Has anybody trained something like that?

It feels like rich late interaction for generation.

@MaximeRivest: current llm architecture is stupid (if not stupid its, at least, wasteful). take these 3 prompts of 4 context chunks: […

Similar Articles

@ickma2311: Efficient AI Lecture 15: Long-Context LLM Long context is not just a bigger prompt window. The key question is: which p…

Quoting Bryan Cantrill

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention [P]

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

LLMs and Memory Limitations - review my thoughts pls

Submit Feedback

Similar Articles

@ickma2311: Efficient AI Lecture 15: Long-Context LLM Long context is not just a bigger prompt window. The key question is: which p…

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention [P]

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

LLMs and Memory Limitations - review my thoughts pls