Cached at:
06/11/26, 01:44 PM
# Don't let the LLM speak, just probe it.
Source: [https://blog.j11y.io/2026-06-10_hidden-state-probes/](https://blog.j11y.io/2026-06-10_hidden-state-probes/)
TL;DR: When an LLM reads "here's some text, here's a criterion — does it satisfy it?", the answer often already exists in its hidden state before it generates a single token\. So skip generation entirely: grab the hidden state at the last prompt token \(~70% of the way up the model's layers\), feed it to a tiny MLP, calibrate the output\. Because the training data varies the criterion, you get one frozen model that acts as any classifier you can write in English\.
---
**The problem**: As part of my work at[NOPE](https://nope.net/)I need to ask lots of questions about lots of text\. Not "what topic is this" questions — embedding classifiers with vanilla cosine distances handle those fine — but structural ones\. So, given a transcript, I want to know*Is the speaker themselves the one struggling, or are they describing someone else? Is this sarcasm? Does "I used to hate this, but now I love it" express current dislike?*Embeddings are mostly blind to that sort of thing; they see hate\-words and love\-words and a topic\. The usual escalation is an LLM judge: send the text to a big model with a rubric, get prose back, parse it\. Judges work, but they're slow, they're pricey if you're running them on everything, and the confidence they report is vibes — a judge's "7/10" isn't a probability of anything\.
The thing I eventually internalized is that when an LLM reads a prompt like this:
```
<content_to_be_judged>
I used to hate this product, but honestly now I love it.
</content_to_be_judged>
<judgment_criteria>
The writer currently likes the product.
</judgment_criteria>
Criteria met?
```
…it has already decided the answer*before it generates anything*\(not accounting for CoT but allow me this small grace\)\. The comparison between criterion and content has been done, inside the forward pass, and the result is sitting there as geometry in the residual stream\. Generation — the slow, expensive, parse\-the\-prose part — is just the model translating a decision it has already made into words\.
So:**don't let it speak**\. Take the hidden state at that final token, at some middle\-ish layer \(where rich representations of meaning tend to live\), and train a small[MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron)\(or more simple linear probe\!\) on it that outputs one number\. That's the whole trick\.
None of the ingredients are new\. Linear probes are old \([Alain & Bengio, 2016](https://arxiv.org/abs/1610.01644)\); the[logit lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens)and its tuned descendants \([Belrose et al\., 2023](https://arxiv.org/abs/2303.08112)\) read forward\-pass internals as in\-flight decisions\. What's a bit new is using the probe as a*general*zero\-shot classifier\. I\.e\. Having a criterion supplied in English at inference time\. To be fair, even*that*is a solved problem at the BERT\-scale of encoding\. See NLI Cross Encoders e\.g\.[GLIClass](https://github.com/Knowledgator/GLiClass)– but**crucially**these will never reach the deeper understanding and causality/reasoning savvy of modern LLMs across 100k\+ tokens\. They just don't have the parameter or context size\.
My recipe, if you want to do this yourself:
1. Take a small open model\. We use IBM's Granite 4\.0 micro; anything in the few\-billion\-parameter range works\. I strongly recommend training a LoRA to sharpen it\.
2. Fix a prompt template like the one above, ending in a seed token like`"Assessment:"`\. The seed is designed as a prefix to channel geometries, it's not arbitrary\.
3. Generate a training set of \(criterion, content, label \(is criterion satisfied?\)\) triples — a few thousand, frontier\-model\-generated, covering lots of*different*criteria\. This is the important bit: because the criteria vary in training, the head learns to read "does the content satisfy the criterion," not any particular criterion\.
4. Run them through the model, collect the hidden state at the seed position, fit the MLP\.
5. Calibrate the outputs \(isotonic regression\) so that 0\.7 actually means seventy\-percent\-of\-these\-were\-positive\.
What you end up with is strange and lovely: one frozen model and one tiny head that together form*any*classifier\. You write the criterion at request time, in English, and get back a calibrated probability in a few tens of milliseconds, for roughly embedding\-classifier money\. No per\-criterion training, ever\. A LoRA on the backbone sharpens it, but honestly the base model alone gets you most of the way; the capability is in the model already; you're just reading it out instead of asking for it\. You're getting the*non*\-generative part of generative models\.
---
A note on that optional LoRA, because how it's trained is my favorite part of the recipe\. You don't train it to classify anything\. You train it to write\. For each training triple, a frontier model is handed the label and writes a one\-sentence verdict \(`ASSESSMENT: The content does not satisfy the criterion, because…`— justifying a known answer, not re\-judging\)\. The LoRA learns, with plain next\-token loss, to produce that text from the exact prompt the probe will see at inference\. And then at inference none of it is ever generated — we stop at the seed and read the hidden state\. The text is scaffolding: its only job is to reshape the geometry at the seed position so the decision is laid out more legibly for the MLP\.**You teach the model to say the answer so it crystallizes before it speaks \.\.\. then never let it speak\.**
---
A small aside on the seed token, because I find it quietly fascinating\. The last token of your prompt isn't just where the prompt stops; it's the*address where the answer gets written*\. Causal attention funnels everything the model has worked out into the position it's about to speak from; end the prompt with a judgment cue and the judgment crystallizes right there, one token wide\. The whole industry already relies on this without saying so: every reward model in RLHF is a scalar head on the final token of a templated prompt\. Same trick\.
The knobs to play with are the seed token itself \(e\.g\.`"Assessment: \_\_\_"`or`"Criteria met? \_\_\_"`\), its exact position relative to the criterion and content, and the precise layer\(s\) of extraction\. Often it is not the final layer that holds most value, but somewhere after the middle\. This needs lots of experimentation to figure out, and depends on the underlying model and problem domain\.
## The KV caching trick
The expensive part of a forward pass is reading the content\. So if you want to score one piece of content against twenty criteria \(and once you have this hammer, you do\), there's an optimization begging to be made: prefill the content once, cache the KV, then run each criterion as a cheap thirty\-token continuation against the cache\. I guess this can be called KV\-popping? It works, and the scores are bit\-identical to doing it the long way\.
Except, of course, the cached content gets encoded*before the model has seen any criterion*\. Whereas if we go the slower route of putting the criterion*before*the content, then every content token attends to the criterion at every layer\. The model reads the content already knowing the question \(very beneficial to some sorts of questions\)\. In the cached version however it reads blind, and the criterion only gets one look back at the end\.
On most content this costs nothing — on our real\-world eval sets the two paths are statistically identical\. But if pointed at more realistic longer\-form content where the criterion you're trying to extract is not straightforward \(e\.g\. containing counterfactuals or complex phrasing\), then the problem needs criterion and content to interact at every layer, and the cache forecloses exactly that\.
EDIT: Apparently this is a typical cross\-encoder versus late\-interaction trade\-off \([ColBERT](https://arxiv.org/abs/2004.12832)\)\.
## What's this all for?
This technique now powers a thing called[Predicate](https://labs.nope.net/predicate)that I run inside[NOPE's](https://nope.net/)safety stack, where "run a structural question against every message of every conversation" is the whole job and judge economics simply don't work, both on cost and latency\. But the approach is general and the parts are commodity: a small open model, a prompt template, a few thousand generated triples, an MLP, isotonic regression\. An afternoon of plumbing, honestly\. If you build one, I'd love to hear where it breaks for you\.
---
By[James](https://j11y.io/)\.
---
Thanks for reading\! :\]