LLMs are not the black box you were promised

Hacker News Top Papers

Summary

An article summarizing Anthropic's 2025 paper on mechanistic interpretability, showing that LLMs are not black boxes and that circuit tracing can reveal multi-step reasoning and human-identifiable concepts.

No content available
Original Article
View Cached Full Text

Cached at: 06/03/26, 12:35 AM

# LLMs are not the Black Box you were promised Source: [https://www.jay.ai/blog/llms-are-not-a-black-box](https://www.jay.ai/blog/llms-are-not-a-black-box) ![Overview figure from Anthropic's 'On the Biology of a Large Language Model,' showing circuit-tracing case studies across multi-step reasoning, planning, multilingual circuits, addition, medical diagnoses, hallucinations, refusals, jailbreaks, and more.](https://www.jay.ai/blog/images/llms-are-not-a-black-box/header.png) On the Biology of a Large Language Model \(Anthropic, 2025\)LLMs are not the "black box" you were promised\. Mechanistic interpretability — peering into a neural network to reverse engineer its inner workings — has made major strides\. Anthropic's[*On the Biology of a Large Language Model*](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)\(2025\) is a landmark in that effort\. What follows is a summary of their progress and some related thoughts\. ## What is an LLM actually "thinking"? How can we understand what an LLM is "thinking"? It's clearly very valuable to do so — it could enable steering model behavior, detecting dangerous intent, and more\. But it's much harder than simply observing individual neuron activations, because of**superposition**: a single neuron participates in many unrelated concepts, and any given concept is smeared across many neurons\. You can't just read meaning off one unit\. You need to get creative\. ![Explanation of why interpreting LLMs directly is hard due to polysemantic neurons, motivating a more interpretable replacement-model architecture.](https://www.jay.ai/blog/images/llms-are-not-a-black-box/superposition-replacement-model.png) ## Circuit tracing One approach: train a second model to identify discrete concepts, then monitor how those concepts interact over the course of a forward pass\. Anthropic's**circuit tracing**technique trains a "replacement" model to sparsely recreate the outputs of the base model's MLP layers\. This effectively decomposes the base model's activations into a set of sparse features — and it turns out these features correspond to high\-level concepts that humans can readily identify, like "Texas" or "the Olympics\." ![Diagram comparing the original transformer model with a sparse replacement model whose features map to human-interpretable concepts.](https://www.jay.ai/blog/images/llms-are-not-a-black-box/circuit-tracing-replacement.png) Once you have these human\-interpretable features, you can group them into causally\-linked clusters by tracing how they interact during the forward pass — building up a wiring diagram of the computation\. ![Simplified circuit graph for 'the capital of the state containing Dallas,' plus an intervention experiment suppressing the Texas feature to change the output.](https://www.jay.ai/blog/images/llms-are-not-a-black-box/capital-circuit.png) ## Models really do reason in multiple steps When you run this in practice, you can watch models engage in genuine multi\-step reasoning via intermediary concepts\. The model will even "think ahead" to future rhyme candidates when planning a poem\. Ask it*"what is the capital of the state containing Dallas"*and you can observe, in order: - the**Dallas**feature goes active, - which causes the**Texas**feature to light up, - which then causes**Austin**to light up\. It seems fairly clear that this is tracing semantic relationships between high\-level concepts — and in doing so, performing a kind of pseudo\-symbolic inference, similar to what some philosophers would describe as "higher reasoning\." ![Examples of multi-step reasoning inside the model, including geographic inference and forward planning of rhymes when writing a poem.](https://www.jay.ai/blog/images/llms-are-not-a-black-box/multi-step-reasoning.png) ## This isn't unique to LLMs This phenomenon doesn't only apply to language models\. MCTS\-based systems like AlphaZero also converge on concepts that humans recognize\. DeepMind \(2022\) showed that AlphaZero learned intermediary representations aligning with human chess concepts such as "in check" and "pinning a piece" — entirely on its own, with no human chess knowledge supplied\. ![DeepMind 2022 research showing AlphaZero learning human-recognizable chess concepts through self-play.](https://www.jay.ai/blog/images/llms-are-not-a-black-box/alphazero-concepts.png) ## Better understanding → better algorithms Breaking down a model's implicit reasoning can help us design better learning algorithms\. For example: Claude 3\.5 Haiku learned an algorithm for small\-integer addition that does*not*cleanly map to human mental math\. It splits the problem into multiple parallel pathways — computing a rough magnitude alongside the precise ones\-digit — and recombines them, leaning on memorized "lookup table" features\. The natural question follows: can we identify this, then "guide" the model toward a better algorithm? ![Explanation of how Claude 3.5 Haiku adds two-digit numbers like 36+59 using multiple parallel pathways and lookup-table features.](https://www.jay.ai/blog/images/llms-are-not-a-black-box/addition-pathways.png) ## The model has a "subconscious" It's worth noting that the model itself does not necessarily have*metacognitive*insight into the underlying thinking process uncovered by circuit tracing\. Ask it to explain how it added two numbers and it will narrate a tidy, human\-style procedure — which is not the algorithm it actually ran\. For better or worse, the model has some level of subconscious\. And that's precisely what lets us peer in\. ![A conversation in which the model gives a human-style explanation for adding 36 and 59 that differs from the algorithm circuit tracing reveals it actually used.](https://www.jay.ai/blog/images/llms-are-not-a-black-box/metacognition-arithmetic.png) ## Why this matters Mechanistic interpretability is a fascinating, fast\-developing line of work with major Ws on the scoreboard\. Contrary to what your ML professor may have told you a decade ago, in some ways this is now the*most*insight we've ever extracted from a model\. And the implications are significant — for identifying model misbehavior, for steering, and even for designing better learning algorithms\. For the original thread, see the[post on X](https://x.com/mathemagic1an/status/2035850046735098065)\. For the full research, read[Anthropic's paper](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)\. Jay Hack

Similar Articles

Quoting Bryan Cantrill

Simon Willison's Blog

Bryan Cantrill critiques LLMs for lacking the optimization constraint of human laziness, arguing that LLMs will unnecessarily complicate systems rather than improve them, and highlighting how human time limitations drive the development of efficient abstractions.

LLMs are bad at vibing specifications

Hillel Wayne — Computer Things

Hillel Wayne discusses how LLMs, while popular for writing formal specifications like TLA+ and Alloy, often produce shallow, tautological properties that fail to capture subtle bugs, based on analysis of community projects.

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

arXiv cs.LG

This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.

Can LLMs Introspect? A Reality Check

arXiv cs.AI

This paper argues that recent claims about LLMs' ability to introspect are not justified, as behavioral evidence alone cannot distinguish genuine introspection from pattern matching on surface-level cues. The authors re-examine two evaluation paradigms and find that models rely on input-level features rather than genuine access to internal states.