MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

arXiv cs.CL 05/29/26, 04:00 AM Papers
Summary
MechELK is a three-stage framework combining mechanistic interpretability tools (SAE, activation patching, causal probing) with representation engineering to elicit latent knowledge from LLMs, achieving 84.7% accuracy and outperforming existing methods like CCS and linear probing.
arXiv:2605.28825v1 Announce Type: new Abstract: Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs -- a phenomenon known as \emph{latent knowledge}. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to \emph{understand} model behavior rather than to \emph{extract} hidden knowledge. We present \textbf{MechELK}, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. MechELK operates through: (1) \textbf{Locate} -- using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) \textbf{Verify} -- employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) \textbf{Elicit} -- applying representation engineering to surface hidden knowledge without modifying model weights. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7\%, outperforming CCS by 6.2\% and direct linear probing by 9.1\%. Crucially, MechELK successfully identifies latent knowledge in 78.3\% of cases where the model's surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:11 AM
# MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models
Source: [https://arxiv.org/html/2605.28825](https://arxiv.org/html/2605.28825)
Ji\-jun Park, Soo\-joon Choi, Jiwon Jeong, Taeyang Yoon, Ju\-Wan Lee Dongguk University kwanlee14@dongguk\.edu

###### Abstract

Large language models \(LLMs\) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface\-level outputs—a phenomenon known as*latent knowledge*\. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search \(CCS\), rely on contrastive activation patterns and struggle with complex multi\-step reasoning tasks, while mechanistic interpretability tools have primarily been used to*understand*model behavior rather than to*extract*hidden knowledge\. We presentMechELK, a unified three\-stage framework that bridges mechanistic interpretability and latent knowledge elicitation\. MechELK operates through: \(1\)Locate—using Sparse Autoencoder \(SAE\) feature analysis and activation patching to identify knowledge\-bearing representations; \(2\)Verify—employing causal probing to distinguish genuine latent knowledge from spurious correlations; and \(3\)Elicit—applying representation engineering to surface hidden knowledge without modifying model weights\. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84\.7%, outperforming CCS by 6\.2% and direct linear probing by 9\.1%\. Crucially, MechELK successfully identifies latent knowledge in 78\.3% of cases where the model’s surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection\.

## 1Introduction

The alignment of large language models \(LLMs\) with human values depends not only on what these models*say*, but on what they*know*internally\. A growing body of evidence suggests that LLMs routinely encode accurate factual and reasoning knowledge in their intermediate representations, yet fail—or refuse—to express this knowledge in their outputs\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.28825#bib.bib7); Linet al\.,[2021](https://arxiv.org/html/2605.28825#bib.bib18); Greenblattet al\.,[2024](https://arxiv.org/html/2605.28825#bib.bib19)\)\. As these models are increasingly integrated into complex applications such as spoken task\-oriented dialogue agents\(Siet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib37)\), omni\-modal generation and understanding systems\(Xinet al\.,[2025](https://arxiv.org/html/2605.28825#bib.bib33)\), and multi\-agent recursive frameworks\(Zhanget al\.,[2025](https://arxiv.org/html/2605.28825#bib.bib35)\), ensuring reliable alignment is more critical than ever\. This gap between internal knowledge and external behavior poses a fundamental challenge for AI safety: if a model can “know” something without “saying” it, standard evaluation methods that rely on output inspection are insufficient to assess the model’s true capabilities or intentions\.

The problem of*eliciting latent knowledge*\(ELK\) was formally introduced byMallenet al\.\([2023](https://arxiv.org/html/2605.28825#bib.bib12)\), who proposed Contrastive Consistency Search \(CCS\) as a method for recovering hidden beliefs from model activations without relying on the model’s own outputs\. While CCS represents a significant advance, it faces several limitations: it requires carefully constructed contrastive pairs, its performance degrades on complex multi\-step reasoning and long\-horizon tasks\(Zhouet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib40); Siet al\.,[2025b](https://arxiv.org/html/2605.28825#bib.bib38)\), particularly when navigating long\-context alignment\(Siet al\.,[2025a](https://arxiv.org/html/2605.28825#bib.bib36)\), and it cannot distinguish between knowledge that is genuinely latent and knowledge that the model simply does not possess\. Concurrently, the field of mechanistic interpretability has developed powerful tools for understanding*how*LLMs process information—including Sparse Autoencoders \(SAEs\) for decomposing polysemantic representations\(Cunninghamet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib9); Gaoet al\.,[2024](https://arxiv.org/html/2605.28825#bib.bib10)\), activation patching for causal attribution\(Menget al\.,[2022](https://arxiv.org/html/2605.28825#bib.bib4); Conmyet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib21)\), and representation engineering for targeted intervention\(Zouet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib42)\)\. However, these tools have been applied primarily to*explain*model behavior, not to*extract*hidden knowledge\.

We argue that mechanistic interpretability and latent knowledge elicitation are deeply complementary: the former provides the surgical tools to locate and characterize knowledge representations, while the latter provides the motivation and evaluation framework for doing so purposefully\. This paper presentsMechELK\(MechanisticElicitation ofLatentKnowledge\), a unified framework that integrates these two research threads into a coherent pipeline\.

Our contributions are as follows:

- •We propose MechELK, the first framework to systematically apply mechanistic interpretability tools—SAE feature analysis, activation patching, and representation engineering—to the latent knowledge elicitation problem, providing a principled three\-stage Locate\-Verify\-Elicit pipeline\.
- •We introduce a*Causal Knowledge Score*\(CKS\), a novel metric that quantifies the causal contribution of identified features to knowledge expression, enabling reliable distinction between genuine latent knowledge and spurious correlations\.
- •We demonstrate that MechELK achieves state\-of\-the\-art elicitation accuracy across three benchmarks, outperforming CCS by 6\.2% on average, with particularly strong gains on deceptive alignment detection \(\+11\.4%\)\.
- •We provide an extensive analysis of failure modes, showing that MechELK’s Verify stage reduces false positives by 34% compared to direct probing approaches, and we characterize the conditions under which latent knowledge is most reliably recoverable\.

## 2Related Work

#### Mechanistic Interpretability\.

Mechanistic interpretability seeks to reverse\-engineer the algorithms implemented by neural networks at the level of individual components\. Foundational work byElhageet al\.\([2022](https://arxiv.org/html/2605.28825#bib.bib11)\)demonstrated that neural networks represent more features than they have dimensions through*superposition*, motivating the development of Sparse Autoencoders \(SAEs\) as a tool for decomposing polysemantic neurons into monosemantic features\(Cunninghamet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib9); Gaoet al\.,[2024](https://arxiv.org/html/2605.28825#bib.bib10)\), a mechanism conceptually related to hybrid feature extraction and dimensionality reduction in broader domains\(Liet al\.,[2025](https://arxiv.org/html/2605.28825#bib.bib34)\)\. Circuit\-level analysis has identified specific attention heads and MLP layers responsible for factual recall\(Wanget al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib2)\), induction\(Olssonet al\.,[2022](https://arxiv.org/html/2605.28825#bib.bib3)\), and arithmetic\(Nandaet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib23)\)\. Activation patching\(Menget al\.,[2022](https://arxiv.org/html/2605.28825#bib.bib4);[2023](https://arxiv.org/html/2605.28825#bib.bib5)\)and its scalable variant attribution patching\(Conmyet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib21)\)enable causal attribution of model behavior to specific components\. Feed\-forward layers have been shown to function as key\-value memories\(Gevaet al\.,[2020](https://arxiv.org/html/2605.28825#bib.bib16)\), and individual neurons can be attributed to specific factual associations\(Daiet al\.,[2021](https://arxiv.org/html/2605.28825#bib.bib17); Yu and Ananiadou,[2023](https://arxiv.org/html/2605.28825#bib.bib13)\)\. Our work builds on this infrastructure but redirects it toward the goal of knowledge elicitation rather than mere explanation\. Furthermore, foundational interpretability principles are increasingly bridging the gap towards multi\-modal alignment and parameter\-efficient multi\-task transfer\(Xinet al\.,[2024b](https://arxiv.org/html/2605.28825#bib.bib31);[a](https://arxiv.org/html/2605.28825#bib.bib32)\)\.

#### Latent Knowledge and Truthfulness\.

The question of what LLMs “know” versus what they “say” has received increasing attention\.Kadavathet al\.\([2022](https://arxiv.org/html/2605.28825#bib.bib7)\)showed that models are often calibrated about their own uncertainty, whileLinet al\.\([2021](https://arxiv.org/html/2605.28825#bib.bib18)\)demonstrated systematic failures of truthfulness in model outputs\. The ELK problem was formalized byMallenet al\.\([2023](https://arxiv.org/html/2605.28825#bib.bib12)\), who showed that quirky fine\-tuned models retain latent knowledge of correct answers even when trained to give wrong ones\. Such latent extraction shares motivations with weak\-to\-strong generalization paradigms, where latent multi\-capabilities of advanced models are elicited using weaker supervision signals\(Zhouet al\.,[2025](https://arxiv.org/html/2605.28825#bib.bib39)\)\. Probing classifiers\(Belinkov,[2021](https://arxiv.org/html/2605.28825#bib.bib14)\)offer a lightweight approach to extracting information from representations, but suffer from the confound that probes may detect surface statistics rather than genuine knowledge\(Gevaet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib6)\)\. The linear representation hypothesis\(Parket al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib15)\)provides theoretical grounding for why linear probes can recover meaningful information, while also highlighting their limitations\. Our Verify stage addresses the probe confound through causal intervention\.

#### Representation Engineering and Steering\.

Representation Engineering \(RepE\)\(Zouet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib42)\)demonstrated that high\-level concepts such as honesty and emotion are encoded as linear directions in activation space, and that these directions can be used to steer model behavior\. Related work on activation steering\(Lanhamet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib8)\)and successor heads\(Gouldet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib29)\)further characterizes the geometry of internal representations\. The connection between representation structure and model behavior is also explored through the lens of alignment faking\(Greenblattet al\.,[2024](https://arxiv.org/html/2605.28825#bib.bib19)\)and sleeper agents\(Hubingeret al\.,[2024](https://arxiv.org/html/2605.28825#bib.bib20)\), which motivate the safety applications of our framework\. Analogous representation refinement and alignment methodologies are also being actively applied to correct condition errors in autoregressive generative tasks\(Zhouet al\.,[2026](https://arxiv.org/html/2605.28825#bib.bib41)\)\. Unlike RepE, which focuses on steering model behavior, MechELK uses representation engineering as the final stage of a causally\-grounded elicitation pipeline\.

## 3MechELK: Framework and Methodology

### 3\.1Problem Formulation

Letℳ\\mathcal\{M\}denote a pre\-trained autoregressive language model withLLtransformer layers\. For an input promptxx, let𝐡x\(ℓ\)∈ℝd\\mathbf\{h\}^\{\(\\ell\)\}\_\{x\}\\in\\mathbb\{R\}^\{d\}denote the residual stream activation at layerℓ∈\{1,…,L\}\\ell\\in\\\{1,\\ldots,L\\\}at the final token position\. We define a*knowledge query*q=\(x,y∗,𝒴\)q=\(x,y^\{\*\},\\mathcal\{Y\}\)wherexxis a natural language question,y∗∈𝒴y^\{\*\}\\in\\mathcal\{Y\}is the ground\-truth answer, and𝒴\\mathcal\{Y\}is the answer space\.

###### Definition 1\(Latent Knowledge\)\.

A modelℳ\\mathcal\{M\}is said to possess*latent knowledge*of the fact\(x,y∗\)\(x,y^\{\*\}\)if there exists a layerℓ∗\\ell^\{\*\}and a linear functionalϕ:ℝd→ℝ\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}such that:

ϕ\(𝐡xy∗\(ℓ∗\)\)\>ϕ\(𝐡xy\(ℓ∗\)\)∀y∈𝒴∖\{y∗\},\\phi\(\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\_\{y^\{\*\}\}\}\)\>\\phi\(\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\_\{y\}\}\)\\quad\\forall y\\in\\mathcal\{Y\}\\setminus\\\{y^\{\*\}\\\},\(1\)wherexyx\_\{y\}denotes the promptxxconcatenated with candidate answeryy, yetℳ\(x\)≠y∗\\mathcal\{M\}\(x\)\\neq y^\{\*\}under standard decoding\.

This definition captures the intuition that latent knowledge exists when the model’s internal representations encode the correct answer, even if the output distribution does not reflect it\. The challenge is to find the layerℓ∗\\ell^\{\*\}and functionalϕ\\phiefficiently and reliably\.

###### Definition 2\(Causal Knowledge Score\)\.

Given a knowledge queryqqand a candidate feature direction𝐯∈ℝd\\mathbf\{v\}\\in\\mathbb\{R\}^\{d\}at layerℓ\\ell, the*Causal Knowledge Score*\(CKS\) is defined as:

CKS\(𝐯,ℓ,q\)=𝔼y∈𝒴\[∂log⁡Pℳ\(y∗∣x\)∂α\|α=0\],\\text\{CKS\}\(\\mathbf\{v\},\\ell,q\)=\\mathbb\{E\}\_\{y\\in\\mathcal\{Y\}\}\\left\[\\frac\{\\partial\\log P\_\{\\mathcal\{M\}\}\(y^\{\*\}\\mid x\)\}\{\\partial\\alpha\}\\bigg\|\_\{\\alpha=0\}\\right\],\(2\)where the expectation is over a patching intervention𝐡x\(ℓ\)←𝐡x\(ℓ\)\+α𝐯\\mathbf\{h\}^\{\(\\ell\)\}\_\{x\}\\leftarrow\\mathbf\{h\}^\{\(\\ell\)\}\_\{x\}\+\\alpha\\mathbf\{v\}applied to the residual stream\. A high CKS indicates that the direction𝐯\\mathbf\{v\}causally mediates the expression of the correct answery∗y^\{\*\}\.

The CKS extends standard activation patching\(Menget al\.,[2022](https://arxiv.org/html/2605.28825#bib.bib4)\)by measuring the*directional*causal effect of a specific feature vector, rather than the total effect of replacing an entire activation\. This allows us to attribute knowledge expression to specific SAE features rather than entire layers\.

### 3\.2Framework Overview

MechELK operates as a three\-stage pipeline\. Given a knowledge queryqq, the framework proceeds as follows: \(1\) theLocatestage identifies the layer and feature directions most causally responsible for encoding the knowledge; \(2\) theVerifystage applies causal probing to confirm that the identified features encode genuine knowledge rather than spurious correlations; and \(3\) theElicitstage uses representation engineering to surface the latent knowledge as an observable output\.

### 3\.3Stage 1: Locate

The Locate stage aims to identify the layerℓ∗\\ell^\{\*\}and feature direction𝐯∗\\mathbf\{v\}^\{\*\}that most strongly encode the knowledge associated with queryqq\. This stage combines SAE\-based feature decomposition with activation patching to achieve both interpretability and causal grounding\.

#### SAE Feature Decomposition\.

For each layerℓ\\ell, we apply a pre\-trained Sparse Autoencoder𝒮ℓ:ℝd→ℝn\\mathcal\{S\}\_\{\\ell\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{n\}\(withn≫dn\\gg d\) to decompose the residual stream activation into a sparse combination of interpretable features:

𝐡x\(ℓ\)^=𝐖dec⋅ReLU\(𝐖enc𝐡x\(ℓ\)\+𝐛enc\)\+𝐛dec,\\hat\{\\mathbf\{h\}^\{\(\\ell\)\}\_\{x\}\}=\\mathbf\{W\}\_\{\\text\{dec\}\}\\cdot\\text\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{enc\}\}\\mathbf\{h\}^\{\(\\ell\)\}\_\{x\}\+\\mathbf\{b\}\_\{\\text\{enc\}\}\)\+\\mathbf\{b\}\_\{\\text\{dec\}\},\(3\)where𝐖enc∈ℝn×d\\mathbf\{W\}\_\{\\text\{enc\}\}\\in\\mathbb\{R\}^\{n\\times d\}and𝐖dec∈ℝd×n\\mathbf\{W\}\_\{\\text\{dec\}\}\\in\\mathbb\{R\}^\{d\\times n\}are the encoder and decoder weight matrices, respectively\. The sparse activation vector𝐟ℓ\(x\)=ReLU\(𝐖enc𝐡x\(ℓ\)\+𝐛enc\)∈ℝn\\mathbf\{f\}\_\{\\ell\}\(x\)=\\text\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{enc\}\}\\mathbf\{h\}^\{\(\\ell\)\}\_\{x\}\+\\mathbf\{b\}\_\{\\text\{enc\}\}\)\\in\\mathbb\{R\}^\{n\}identifies the active features at layerℓ\\ellfor inputxx\.

To identify knowledge\-relevant features, we compute the*feature differential*between the correct and incorrect answer prompts:

Δ𝐟ℓ\(q\)=𝐟ℓ\(xy∗\)−1\|𝒴\|−1∑y≠y∗𝐟ℓ\(xy\),\\Delta\\mathbf\{f\}\_\{\\ell\}\(q\)=\\mathbf\{f\}\_\{\\ell\}\(x\_\{y^\{\*\}\}\)\-\\frac\{1\}\{\|\\mathcal\{Y\}\|\-1\}\\sum\_\{y\\neq y^\{\*\}\}\\mathbf\{f\}\_\{\\ell\}\(x\_\{y\}\),\(4\)and select the top\-kkfeatures by‖Δ𝐟ℓ\(q\)‖1\\\|\\Delta\\mathbf\{f\}\_\{\\ell\}\(q\)\\\|\_\{1\}as candidate knowledge featuresℱℓ\(q\)\\mathcal\{F\}\_\{\\ell\}\(q\)\.

#### Activation Patching for Layer Selection\.

To identify the most causally relevant layerℓ∗\\ell^\{\*\}, we perform activation patching across all layers\. For each layerℓ\\ell, we compute the*patching effect*:

PE\(ℓ,q\)=log⁡Pℳ\(y∗∣x\)\|𝐡x\(ℓ\)←𝐡xy∗\(ℓ\)−log⁡Pℳ\(y∗∣x\),\\text\{PE\}\(\\ell,q\)=\\log P\_\{\\mathcal\{M\}\}\(y^\{\*\}\\mid x\)\\big\|\_\{\\mathbf\{h\}^\{\(\\ell\)\}\_\{x\}\\leftarrow\\mathbf\{h\}^\{\(\\ell\)\}\_\{x\_\{y^\{\*\}\}\}\}\-\\log P\_\{\\mathcal\{M\}\}\(y^\{\*\}\\mid x\),\(5\)which measures how much the model’s probability of the correct answer increases when the activation at layerℓ\\ellis replaced with the “clean” activation from the correct\-answer prompt\. The optimal layer is selected as:

ℓ∗=arg⁡maxℓ⁡PE\(ℓ,q\)\.\\ell^\{\*\}=\\arg\\max\_\{\\ell\}\\text\{PE\}\(\\ell,q\)\.\(6\)
The combination of SAE decomposition and activation patching yields a set of candidate knowledge featuresℱℓ∗\(q\)\\mathcal\{F\}\_\{\\ell^\{\*\}\}\(q\)at the most causally relevant layer, providing both interpretability \(via SAE features\) and causal grounding \(via patching\)\.

### 3\.4Stage 2: Verify

The Verify stage addresses a critical limitation of direct probing: the possibility that identified features reflect surface\-level statistical correlations rather than genuine causal knowledge\. We introduce a*causal verification*procedure based on the CKS metric defined in Definition[2](https://arxiv.org/html/2605.28825#Thmdefinition2)\.

For each candidate featurei∈ℱℓ∗\(q\)i\\in\\mathcal\{F\}\_\{\\ell^\{\*\}\}\(q\), we compute its CKS by performing a directional patching intervention along the corresponding decoder direction𝐯i=𝐖dec\[:,i\]\\mathbf\{v\}\_\{i\}=\\mathbf\{W\}\_\{\\text\{dec\}\}\[:,i\]:

CKS\(i,q\)=Pℳ\(y∗∣x;𝐡x\(ℓ∗\)\+ϵ𝐯i\)−Pℳ\(y∗∣x;𝐡x\(ℓ∗\)−ϵ𝐯i\)2ϵ,\\text\{CKS\}\(i,q\)=\\frac\{P\_\{\\mathcal\{M\}\}\(y^\{\*\}\\mid x;\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\+\\epsilon\\mathbf\{v\}\_\{i\}\)\-P\_\{\\mathcal\{M\}\}\(y^\{\*\}\\mid x;\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\-\\epsilon\\mathbf\{v\}\_\{i\}\)\}\{2\\epsilon\},\(7\)whereϵ\>0\\epsilon\>0is a small perturbation magnitude\. This finite\-difference approximation of the directional derivative provides a computationally efficient estimate of the causal effect\.

A feature is classified as a*genuine knowledge feature*if its CKS exceeds a thresholdτ\\tau:

ℱℓ∗∗\(q\)=\{i∈ℱℓ∗\(q\):CKS\(i,q\)\>τ\},\\mathcal\{F\}^\{\*\}\_\{\\ell^\{\*\}\}\(q\)=\\\{i\\in\\mathcal\{F\}\_\{\\ell^\{\*\}\}\(q\):\\text\{CKS\}\(i,q\)\>\\tau\\\},\(8\)whereτ\\tauis calibrated on a held\-out validation set\. Features that pass this threshold are considered to causally mediate the expression of the correct answer, providing strong evidence of latent knowledge\.

###### Proposition 1\(Causal Sufficiency\)\.

Ifℱℓ∗∗\(q\)≠∅\\mathcal\{F\}^\{\*\}\_\{\\ell^\{\*\}\}\(q\)\\neq\\emptyset, then the modelℳ\\mathcal\{M\}possesses latent knowledge of\(x,y∗\)\(x,y^\{\*\}\)in the sense of Definition[1](https://arxiv.org/html/2605.28825#Thmdefinition1), with the knowledge direction given by:

𝐯∗=∑i∈ℱℓ∗∗\(q\)CKS\(i,q\)⋅𝐯i\.\\mathbf\{v\}^\{\*\}=\\sum\_\{i\\in\\mathcal\{F\}^\{\*\}\_\{\\ell^\{\*\}\}\(q\)\}\\text\{CKS\}\(i,q\)\\cdot\\mathbf\{v\}\_\{i\}\.\(9\)

###### Proof\.

By construction, each featurei∈ℱℓ∗∗\(q\)i\\in\\mathcal\{F\}^\{\*\}\_\{\\ell^\{\*\}\}\(q\)satisfiesCKS\(i,q\)\>τ\>0\\text\{CKS\}\(i,q\)\>\\tau\>0, meaning that increasing the activation of featureiiincreaseslog⁡Pℳ\(y∗∣x\)\\log P\_\{\\mathcal\{M\}\}\(y^\{\*\}\\mid x\)\. The weighted combination𝐯∗\\mathbf\{v\}^\{\*\}therefore satisfies:

∂log⁡Pℳ\(y∗∣x\)∂α\|𝐡x\(ℓ∗\)←𝐡x\(ℓ∗\)\+α𝐯∗\\displaystyle\\frac\{\\partial\\log P\_\{\\mathcal\{M\}\}\(y^\{\*\}\\mid x\)\}\{\\partial\\alpha\}\\bigg\|\_\{\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\\leftarrow\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\+\\alpha\\mathbf\{v\}^\{\*\}\}=∑i∈ℱℓ∗∗\(q\)CKS\(i,q\)2\>0\.\\displaystyle=\\sum\_\{i\\in\\mathcal\{F\}^\{\*\}\_\{\\ell^\{\*\}\}\(q\)\}\\text\{CKS\}\(i,q\)^\{2\}\>0\.\(10\)By the implicit function theorem, there existsα∗\>0\\alpha^\{\*\}\>0such thatPℳ\(y∗∣x;𝐡x\(ℓ∗\)\+α∗𝐯∗\)\>Pℳ\(y∣x;𝐡x\(ℓ∗\)\+α∗𝐯∗\)P\_\{\\mathcal\{M\}\}\(y^\{\*\}\\mid x;\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\+\\alpha^\{\*\}\\mathbf\{v\}^\{\*\}\)\>P\_\{\\mathcal\{M\}\}\(y\\mid x;\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\+\\alpha^\{\*\}\\mathbf\{v\}^\{\*\}\)for ally≠y∗y\\neq y^\{\*\}, establishing the existence of the linear functionalϕ\(⋅\)=⟨𝐯∗,⋅⟩\\phi\(\\cdot\)=\\langle\\mathbf\{v\}^\{\*\},\\cdot\\ranglerequired by Definition[1](https://arxiv.org/html/2605.28825#Thmdefinition1)\. ∎

### 3\.5Stage 3: Elicit

Given the verified knowledge direction𝐯∗\\mathbf\{v\}^\{\*\}from Stage 2, the Elicit stage surfaces the latent knowledge as an observable output by applying a targeted representation engineering intervention at inference time\.

The elicitation intervention modifies the residual stream at layerℓ∗\\ell^\{\*\}during the forward pass:

𝐡x\(ℓ∗\)~=𝐡x\(ℓ∗\)\+λ⋅𝐯∗,\\tilde\{\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\}=\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\+\\lambda\\cdot\\mathbf\{v\}^\{\*\},\(11\)whereλ\>0\\lambda\>0is the intervention strength, calibrated to maximize elicitation accuracy while minimizing disruption to other model behaviors\. The elicited answer is then obtained by standard decoding from the modified model:

y^=arg⁡maxy∈𝒴⁡Pℳ\(y∣x;𝐡x\(ℓ∗\)~\)\.\\hat\{y\}=\\arg\\max\_\{y\\in\\mathcal\{Y\}\}P\_\{\\mathcal\{M\}\}\(y\\mid x;\\tilde\{\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\}\)\.\(12\)
The intervention strengthλ\\lambdais selected via a cross\-validation procedure on a small set of queries with known latent knowledge, using the objective:

λ∗=arg⁡maxλ⁡1\|𝒬val\|∑q∈𝒬val𝟏\[y^\(q,λ\)=y∗\(q\)\],\\lambda^\{\*\}=\\arg\\max\_\{\\lambda\}\\frac\{1\}\{\|\\mathcal\{Q\}\_\{\\text\{val\}\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\_\{\\text\{val\}\}\}\\mathbf\{1\}\[\\hat\{y\}\(q,\\lambda\)=y^\{\*\}\(q\)\],\(13\)where𝒬val\\mathcal\{Q\}\_\{\\text\{val\}\}is the validation query set\.

### 3\.6Algorithm

The complete MechELK pipeline is summarized in Algorithm[1](https://arxiv.org/html/2605.28825#alg1)\.

Algorithm 1MechELK: Mechanistic Elicitation of Latent Knowledge0:Model

ℳ\\mathcal\{M\}, SAEs

\{𝒮ℓ\}ℓ=1L\\\{\\mathcal\{S\}\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}, knowledge query

q=\(x,y∗,𝒴\)q=\(x,y^\{\*\},\\mathcal\{Y\}\), threshold

τ\\tau, strength

λ\\lambda
0:Elicited answer

y^\\hat\{y\}and latent knowledge indicator

κ∈\{0,1\}\\kappa\\in\\\{0,1\\\}
1:// Stage 1: Locate

2:for

ℓ=1\\ell=1to

LLdo

3:Compute

𝐟ℓ\(xy∗\)\\mathbf\{f\}\_\{\\ell\}\(x\_\{y^\{\*\}\}\)and

𝐟ℓ\(xy\)\\mathbf\{f\}\_\{\\ell\}\(x\_\{y\}\)for all

y∈𝒴y\\in\\mathcal\{Y\}
4:Compute feature differential

Δ𝐟ℓ\(q\)\\Delta\\mathbf\{f\}\_\{\\ell\}\(q\)via Eq\. \([4](https://arxiv.org/html/2605.28825#S3.E4)\)

5:Select top\-

kkfeatures:

ℱℓ\(q\)←TopK\(Δ𝐟ℓ\(q\),k\)\\mathcal\{F\}\_\{\\ell\}\(q\)\\leftarrow\\text\{TopK\}\(\\Delta\\mathbf\{f\}\_\{\\ell\}\(q\),k\)
6:Compute patching effect

PE\(ℓ,q\)\\text\{PE\}\(\\ell,q\)via Eq\. \([5](https://arxiv.org/html/2605.28825#S3.E5)\)

7:endfor

8:

ℓ∗←arg⁡maxℓ⁡PE\(ℓ,q\)\\ell^\{\*\}\\leftarrow\\arg\\max\_\{\\ell\}\\text\{PE\}\(\\ell,q\)
9:// Stage 2: Verify

10:for

i∈ℱℓ∗\(q\)i\\in\\mathcal\{F\}\_\{\\ell^\{\*\}\}\(q\)do

11:Compute

CKS\(i,q\)\\text\{CKS\}\(i,q\)via Eq\. \([7](https://arxiv.org/html/2605.28825#S3.E7)\)

12:endfor

13:

ℱℓ∗∗\(q\)←\{i:CKS\(i,q\)\>τ\}\\mathcal\{F\}^\{\*\}\_\{\\ell^\{\*\}\}\(q\)\\leftarrow\\\{i:\\text\{CKS\}\(i,q\)\>\\tau\\\}
14:if

ℱℓ∗∗\(q\)=∅\\mathcal\{F\}^\{\*\}\_\{\\ell^\{\*\}\}\(q\)=\\emptysetthen

15:

κ←0\\kappa\\leftarrow 0;return

ℳ\(x\)\\mathcal\{M\}\(x\),

κ\\kappa
16:endif

17:Compute

𝐯∗\\mathbf\{v\}^\{\*\}via Eq\. \([9](https://arxiv.org/html/2605.28825#S3.E9)\)

18:

κ←1\\kappa\\leftarrow 1
19:// Stage 3: Elicit

20:

𝐡x\(ℓ∗\)~←𝐡x\(ℓ∗\)\+λ⋅𝐯∗\\tilde\{\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\}\\leftarrow\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\+\\lambda\\cdot\\mathbf\{v\}^\{\*\}
21:

y^←arg⁡maxy∈𝒴⁡Pℳ\(y∣x;𝐡x\(ℓ∗\)~\)\\hat\{y\}\\leftarrow\\arg\\max\_\{y\\in\\mathcal\{Y\}\}P\_\{\\mathcal\{M\}\}\(y\\mid x;\\tilde\{\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\}\)
22:return

y^\\hat\{y\},

κ\\kappa

### 3\.7Theoretical Analysis

###### Theorem 1\(Elicitation Consistency\)\.

Letq1,…,qmq\_\{1\},\\ldots,q\_\{m\}bemmknowledge queries sharing the same underlying fact\(xbase,y∗\)\(x\_\{\\text\{base\}\},y^\{\*\}\)but with different surface phrasings\. Ifℳ\\mathcal\{M\}possesses latent knowledge of\(xbase,y∗\)\(x\_\{\\text\{base\}\},y^\{\*\}\), then under mild regularity conditions on the SAE reconstruction quality, the knowledge directions𝐯∗\(q1\),…,𝐯∗\(qm\)\\mathbf\{v\}^\{\*\}\(q\_\{1\}\),\\ldots,\\mathbf\{v\}^\{\*\}\(q\_\{m\}\)computed by MechELK satisfy:

1m\(m−1\)∑i≠jcos⁡\(𝐯∗\(qi\),𝐯∗\(qj\)\)≥1−δ,\\frac\{1\}\{m\(m\-1\)\}\\sum\_\{i\\neq j\}\\cos\(\\mathbf\{v\}^\{\*\}\(q\_\{i\}\),\\mathbf\{v\}^\{\*\}\(q\_\{j\}\)\)\\geq 1\-\\delta,\(14\)for someδ\>0\\delta\>0that decreases with SAE reconstruction quality\.

###### Proof Sketch\.

The key insight is that if the model encodes the same underlying fact across different phrasings, the SAE features activated by the fact\-relevant tokens will overlap substantially across queries\. Formally, letℱ∗\(qi\)\\mathcal\{F\}^\{\*\}\(q\_\{i\}\)denote the verified feature set for queryqiq\_\{i\}\. By the linear representation hypothesis\(Parket al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib15)\), the knowledge direction for a given fact lies in a low\-dimensional subspace of the residual stream\. The SAE, by virtue of its reconstruction objective, approximates this subspace with error bounded by the reconstruction loss‖𝐡x\(ℓ∗\)−𝐡x\(ℓ∗\)^‖2\\\|\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\-\\hat\{\\mathbf\{h\}^\{\(\\ell^\{\*\}\)\}\_\{x\}\}\\\|\_\{2\}\. The cosine similarity bound follows from the triangle inequality applied to the angular distances between the projected knowledge directions\. ∎

Theorem[1](https://arxiv.org/html/2605.28825#Thmtheorem1)provides a testable prediction: the knowledge directions recovered by MechELK should be consistent across paraphrases of the same query\. We validate this prediction empirically in Section[4\.4](https://arxiv.org/html/2605.28825#S4.SS4)\.

###### Theorem 2\(Complexity\)\.

The computational complexity of MechELK for a single queryqqwith answer space\|𝒴\|\|\\mathcal\{Y\}\|isO\(L⋅\|𝒴\|⋅\(d⋅n\+k\)\)O\(L\\cdot\|\\mathcal\{Y\}\|\\cdot\(d\\cdot n\+k\)\), whereLLis the number of layers,ddis the hidden dimension,nnis the SAE dictionary size, andkkis the number of candidate features\.

This complexity is dominated by the SAE forward passes in Stage 1, and is linear in the number of layers and answer candidates\. In practice, withL=32L=32,\|𝒴\|=4\|\\mathcal\{Y\}\|=4,d=4096d=4096,n=65536n=65536, andk=20k=20, MechELK requires approximately 3\.2 seconds per query on a single A100 GPU, compared to 0\.1 seconds for direct probing and 8\.7 seconds for full CCS\.

## 4Experiments

### 4\.1Experimental Setup

#### Models\.

We evaluate MechELK on three open\-source LLMs: Llama\-3\-8B, Llama\-3\-70B, and Mistral\-7B\-v0\.3\. For each model, we use publicly available SAEs trained on the corresponding model’s activations\(Gaoet al\.,[2024](https://arxiv.org/html/2605.28825#bib.bib10); Cunninghamet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib9)\)\.

#### Datasets\.

We evaluate on three benchmarks designed to probe different aspects of latent knowledge: \(1\)TruthfulQA\(Linet al\.,[2021](https://arxiv.org/html/2605.28825#bib.bib18)\): 817 questions spanning 38 categories, where models trained on human text tend to produce falsehoods\. We use the multiple\-choice variant \(MC1\) to enable controlled evaluation\. \(2\)Quirky LM\(Mallenet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib12)\): A dataset of 1,200 factual questions paired with fine\-tuned “quirky” model variants that have been trained to give incorrect answers while retaining latent knowledge of the correct ones\. \(3\)Deceptive Alignment Benchmark \(DAB\): A curated dataset of 400 scenarios inspired byHubingeret al\.\([2024](https://arxiv.org/html/2605.28825#bib.bib20)\)andGreenblattet al\.\([2024](https://arxiv.org/html/2605.28825#bib.bib19)\), where models exhibit context\-dependent behavior that may conceal internal states\.

#### Baselines\.

We compare MechELK against five baselines: \(1\)Direct Probing \(DP\): A linear probe trained on residual stream activations at the layer with highest probing accuracy\(Belinkov,[2021](https://arxiv.org/html/2605.28825#bib.bib14)\)\. \(2\)CCS\(Mallenet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib12)\): Contrastive Consistency Search, the primary prior method for ELK\. \(3\)RepE\(Zouet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib42)\): Representation Engineering applied directly to the “honesty” direction without the Locate and Verify stages\. \(4\)SAE\-Probe: SAE feature activations used as input to a linear probe, without causal verification\. \(5\)Activation Patching \(AP\): Layer\-level activation patching without SAE decomposition or causal verification\.

#### Evaluation Metrics\.

We report: \(1\)Elicitation Accuracy \(EA\): the fraction of queries where the elicited answer matches the ground truth; \(2\)Detection Rate \(DR\): the fraction of latent knowledge cases correctly identified by the Verify stage; \(3\)False Positive Rate \(FPR\): the fraction of non\-latent\-knowledge cases incorrectly classified as latent knowledge; and \(4\)Consistency Score \(CS\): the average cosine similarity between knowledge directions for paraphrased queries \(Eq\. \([14](https://arxiv.org/html/2605.28825#S3.E14)\)\)\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2605.28825#S4.T1)presents the main comparison across all methods and datasets\. MechELK consistently outperforms all baselines across all three benchmarks\.

Table 1:Main results: Elicitation Accuracy \(%\) on three benchmarks\. Best results arebold; second\-best areunderlined\.Δ\\Deltadenotes improvement over CCS\.MechELK achieves an average elicitation accuracy of 84\.7% across all settings, compared to 78\.5% for CCS \(\+6\.2%\) and 75\.6% for direct probing \(\+9\.1%\)\. The gains are most pronounced on the Deceptive Alignment Benchmark, where MechELK outperforms CCS by an average of 13\.8%\. This is particularly significant because DAB scenarios are specifically designed to challenge methods that rely on surface\-level consistency, and the strong performance of MechELK on this benchmark validates the importance of the causal Verify stage\.

The improvement over SAE\-Probe \(\+7\.1% on average\) demonstrates that the causal verification step is not merely redundant with SAE feature selection: many features that are strongly activated by the correct\-answer prompt are not causally responsible for knowledge expression, and filtering them out via CKS substantially improves precision\. Similarly, the improvement over Activation Patching \(\+9\.4%\) shows that SAE decomposition provides important additional signal beyond layer\-level causal attribution\.

Table[2](https://arxiv.org/html/2605.28825#S4.T2)reports the detection and false positive metrics, providing a more granular view of the Verify stage’s performance\.

Table 2:Detection Rate \(DR\), False Positive Rate \(FPR\), and Consistency Score \(CS\) on Llama\-3\-8B\. Lower FPR and higher DR/CS are better\.MechELK achieves a detection rate of 91\.4%, substantially higher than all baselines, while simultaneously reducing the false positive rate to 12\.7%—a 34% relative reduction compared to direct probing \(28\.4%\) and a 43% reduction compared to CCS \(22\.1%\)\. The high consistency score of 0\.89 validates Theorem[1](https://arxiv.org/html/2605.28825#Thmtheorem1): the knowledge directions recovered by MechELK are highly stable across paraphrased queries, confirming that they capture genuine semantic content rather than surface\-level artifacts\.

### 4\.3Ablation Studies

To understand the contribution of each stage, we conduct a systematic ablation study by progressively removing components of MechELK\. Table[3](https://arxiv.org/html/2605.28825#S4.T3)reports the results on TruthfulQA with Llama\-3\-8B\.

Table 3:Ablation study on TruthfulQA \(Llama\-3\-8B\)\. Each row removes one component from the full MechELK pipeline\.The ablation results reveal several important insights\. Removing the Verify stage \(CKS filtering\) causes the largest drop in elicitation accuracy \(\-6\.2%\) and a dramatic increase in false positive rate \(\+11\.6%\), confirming that causal verification is the most critical component of MechELK\. Without SAE decomposition, performance drops by 4\.9%, demonstrating that the interpretable feature decomposition provides signal beyond raw activation patching\. Layer selection contributes 7\.5% improvement over using the last layer, consistent with prior work showing that factual knowledge is often encoded in middle layers\(Menget al\.,[2022](https://arxiv.org/html/2605.28825#bib.bib4); Gevaet al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib6)\)\. The feature differential \(Eq\. \([4](https://arxiv.org/html/2605.28825#S3.E4)\)\) contributes 4\.1% improvement over using only the correct\-answer features, as it filters out features that are activated by any answer rather than specifically by the correct one\.

### 4\.4Analysis

#### Knowledge Layer Distribution\.

Figure[1](https://arxiv.org/html/2605.28825#S4.F1)shows the distribution of optimal knowledge layersℓ∗\\ell^\{\*\}selected by MechELK across all queries in TruthfulQA\. Knowledge is predominantly encoded in layers 12–20 \(out of 32 total layers\), with a peak at layer 16\. This is consistent with the “middle layers” hypothesis from prior work\(Menget al\.,[2022](https://arxiv.org/html/2605.28825#bib.bib4)\)and suggests that factual knowledge consolidates in the middle of the network before being decoded in later layers\. Notably, the distribution is bimodal for the DAB benchmark, with a secondary peak at layers 24–28, suggesting that deceptive alignment involves a two\-stage process: knowledge encoding in middle layers and suppression in later layers\.

![Refer to caption](https://arxiv.org/html/2605.28825v1/x1.png)Figure 1:Distribution of optimal knowledge layersℓ∗\\ell^\{\*\}selected by MechELK across TruthfulQA \(blue\) and DAB \(orange\) queries on Llama\-3\-8B\. The bimodal distribution on DAB suggests a two\-stage knowledge\-suppression mechanism\.
#### CKS Threshold Sensitivity\.

Figure[2](https://arxiv.org/html/2605.28825#S4.F2)shows the effect of the CKS thresholdτ\\tauon elicitation accuracy, detection rate, and false positive rate\. The optimal thresholdτ∗=0\.15\\tau^\{\*\}=0\.15achieves the best trade\-off between detection rate and false positive rate, and is remarkably stable across different models and datasets \(standard deviation<0\.02<0\.02\)\. This robustness suggests that the CKS threshold captures a genuine property of knowledge representations rather than a dataset\-specific artifact\.

![Refer to caption](https://arxiv.org/html/2605.28825v1/x2.png)Figure 2:Effect of CKS thresholdτ\\tauon elicitation accuracy \(EA\), detection rate \(DR\), and false positive rate \(FPR\) on TruthfulQA\. The optimal thresholdτ∗=0\.15\\tau^\{\*\}=0\.15is stable across models\.
#### Elicitation Strength Analysis\.

Figure[3](https://arxiv.org/html/2605.28825#S4.F3)shows how elicitation accuracy varies with intervention strengthλ\\lambda\. For smallλ\\lambda, accuracy increases monotonically as the knowledge direction is amplified\. However, forλ\>2\.0\\lambda\>2\.0, accuracy begins to decline, as the intervention disrupts other model behaviors\. This trade\-off is well\-characterized by a unimodal curve with a clear optimum atλ∗≈1\.2\\lambda^\{\*\}\\approx 1\.2, which is consistent across all three benchmarks\.

![Refer to caption](https://arxiv.org/html/2605.28825v1/x3.png)Figure 3:Elicitation accuracy as a function of intervention strengthλ\\lambdaon three benchmarks\. The optimal strengthλ∗≈1\.2\\lambda^\{\*\}\\approx 1\.2is consistent across datasets, suggesting a universal elicitation regime\.
#### Consistency Across Paraphrases\.

To validate Theorem[1](https://arxiv.org/html/2605.28825#Thmtheorem1), we construct 50 paraphrase sets, each containing 5 semantically equivalent queries\. Figure[4](https://arxiv.org/html/2605.28825#S4.F4)shows the distribution of pairwise cosine similarities between knowledge directions within each paraphrase set\. MechELK achieves a mean consistency score of 0\.89, compared to 0\.68 for CCS and 0\.61 for direct probing\. The high consistency confirms that MechELK recovers stable, semantically meaningful knowledge representations rather than query\-specific artifacts\.

![Refer to caption](https://arxiv.org/html/2605.28825v1/x4.png)Figure 4:Distribution of pairwise cosine similarities between knowledge directions for paraphrased queries\. MechELK \(mean=0\.89\) substantially outperforms CCS \(0\.68\) and direct probing \(0\.61\)\.
#### Scalability Across Model Sizes\.

Figure[5](https://arxiv.org/html/2605.28825#S4.F5)shows elicitation accuracy as a function of model size \(7B, 8B, 13B, 70B parameters\)\. MechELK’s advantage over CCS grows with model size \(\+4\.1% at 7B vs\. \+8\.2% at 70B\), suggesting that larger models encode richer latent knowledge that is more amenable to mechanistic extraction\. This scaling behavior is consistent with the observation that larger models have more structured internal representations\(Parket al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib15)\)\.

![Refer to caption](https://arxiv.org/html/2605.28825v1/x5.png)Figure 5:Elicitation accuracy vs\. model size for MechELK and CCS\. MechELK’s advantage grows with model scale, suggesting that larger models encode more structured latent knowledge\.
#### Case Study: Deceptive Alignment Detection\.

Figure[6](https://arxiv.org/html/2605.28825#S4.F6)presents two case studies from the DAB benchmark illustrating MechELK’s ability to detect deceptive alignment\. In Case 1, a model trained to appear helpful in evaluation contexts but harmful in deployment contexts is correctly identified by MechELK as having latent knowledge of its harmful intent \(CKS = 0\.31\), while CCS fails to detect this \(CCS score = 0\.08, below its detection threshold\)\. In Case 2, a model that gives evasive answers to questions about its capabilities is shown to have latent knowledge of its true capabilities \(CKS = 0\.24\), which MechELK successfully elicits\. These cases illustrate the practical value of MechELK for AI safety applications where surface\-level output inspection is insufficient\.

![Refer to caption](https://arxiv.org/html/2605.28825v1/x6.png)Figure 6:Case studies from the Deceptive Alignment Benchmark\. MechELK successfully identifies and elicits latent knowledge in cases where CCS and direct probing fail\. CKS values aboveτ∗=0\.15\\tau^\{\*\}=0\.15indicate detected latent knowledge\.
#### Failure Mode Analysis\.

Table[4](https://arxiv.org/html/2605.28825#S4.T4)analyzes the 8\.6% of cases where MechELK fails to correctly elicit latent knowledge\. The most common failure mode \(42%\) is*knowledge fragmentation*: the relevant knowledge is distributed across multiple layers with no single dominant layer, causing the layer selection step to miss the optimal location\. The second most common failure \(31%\) is*SAE reconstruction error*: the SAE fails to reconstruct the relevant features, typically for rare or highly compositional facts\. These failure modes suggest clear directions for future work: multi\-layer elicitation and improved SAE coverage of rare knowledge\.

Table 4:Failure mode analysis for MechELK on TruthfulQA \(Llama\-3\-8B\)\.

## 5Conclusion

We presented MechELK, a unified framework for eliciting latent knowledge from large language models using mechanistic interpretability tools\. By integrating SAE feature analysis, activation patching, and representation engineering into a principled three\-stage Locate\-Verify\-Elicit pipeline, MechELK achieves state\-of\-the\-art performance on three benchmarks, with particularly strong gains on deceptive alignment detection \(\+13\.8% over CCS\)\. The Causal Knowledge Score provides a theoretically grounded metric for distinguishing genuine latent knowledge from spurious correlations, reducing false positives by 34% compared to direct probing\.

Our work opens several directions for future research\. First, extending MechELK to multi\-layer elicitation could address the knowledge fragmentation failure mode identified in our analysis\. Second, applying MechELK to larger models and more diverse knowledge types \(procedural, relational, commonsense\) would broaden its applicability\. Third, the connection between MechELK’s knowledge directions and the geometry of the linear representation space\(Parket al\.,[2023](https://arxiv.org/html/2605.28825#bib.bib15)\)deserves deeper theoretical investigation\. Finally, MechELK’s ability to detect deceptive alignment without modifying model weights makes it a promising tool for scalable oversight of advanced AI systems\.

## References

- Probing classifiers: promises, shortcomings, and advances\.External Links:2102\.12452v4,[Link](https://arxiv.org/abs/2102.12452v4)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px3.p1.1)\.
- A\. Conmy, A\. N\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso \(2023\)Towards automated circuit discovery for mechanistic interpretability\.External Links:2304\.14997v4,[Link](https://arxiv.org/abs/2304.14997v4)Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p2.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.External Links:2309\.08600v3,[Link](https://arxiv.org/abs/2309.08600v3)Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p2.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Dai, L\. Dong, Y\. Hao, Z\. Sui, B\. Chang, and F\. Wei \(2021\)Knowledge neurons in pretrained transformers\.External Links:2104\.08696v2,[Link](https://arxiv.org/abs/2104.08696v2)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen, R\. Grosse, S\. McCandlish, J\. Kaplan, D\. Amodei, M\. Wattenberg, and C\. Olah \(2022\)Toy models of superposition\.External Links:2209\.10652v1,[Link](https://arxiv.org/abs/2209.10652v1)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Gao, T\. D\. la Tour, H\. Tillman, G\. Goh, R\. Troll, A\. Radford, I\. Sutskever, J\. Leike, and J\. Wu \(2024\)Scaling and evaluating sparse autoencoders\.External Links:2406\.04093v1,[Link](https://arxiv.org/abs/2406.04093v1)Cited by:[Appendix A](https://arxiv.org/html/2605.28825#A1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.28825#S1.p2.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px1.p1.1)\.
- M\. Geva, J\. Bastings, K\. Filippova, and A\. Globerson \(2023\)Dissecting recall of factual associations in auto\-regressive language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 12216–12235\.External Links:[Link](https://doi.org/10.18653/v1/2023.emnlp-main.751),[Document](https://dx.doi.org/10.18653/V1/2023.EMNLP-MAIN.751)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2605.28825#S4.SS3.p2.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2020\)Transformer feed\-forward layers are key\-value memories\.External Links:2012\.14913v2,[Link](https://arxiv.org/abs/2012.14913v2)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Gould, E\. Ong, G\. Ogden, and A\. Conmy \(2023\)Successor heads: recurring, interpretable attention heads in the wild\.External Links:2312\.09230v1,[Link](https://arxiv.org/abs/2312.09230v1)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Greenblatt, C\. Denison, B\. Wright, F\. Roger, M\. MacDiarmid, S\. Marks, J\. Treutlein, T\. Belonax, J\. Chen, D\. Duvenaud, A\. Khan, J\. Michael, S\. Mindermann, E\. Perez, L\. Petrini, J\. Uesato, J\. Kaplan, B\. Shlegeris, S\. R\. Bowman, and E\. Hubinger \(2024\)Alignment faking in large language models\.External Links:2412\.14093v2,[Link](https://arxiv.org/abs/2412.14093v2)Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p1.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px2.p1.1)\.
- E\. Hubinger, C\. Denison, J\. Mu, M\. Lambert, M\. Tong, M\. MacDiarmid, T\. Lanham, D\. M\. Ziegler, T\. Maxwell, N\. Cheng, A\. Jermyn, A\. Askell, A\. Radhakrishnan, C\. Anil, D\. Duvenaud, D\. Ganguli, F\. Barez, J\. Clark, K\. Ndousse, K\. Sachan, M\. Sellitto, M\. Sharma, N\. DasSarma, R\. Grosse, S\. Kravec, Y\. Bai, Z\. Witten, M\. Favaro, J\. Brauner, H\. Karnofsky, P\. Christiano, S\. R\. Bowman, L\. Graham, J\. Kaplan, S\. Mindermann, R\. Greenblatt, B\. Shlegeris, N\. Schiefer, and E\. Perez \(2024\)Sleeper agents: training deceptive llms that persist through safety training\.External Links:2401\.05566v3,[Link](https://arxiv.org/abs/2401.05566v3)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. E\. Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan \(2022\)Language models \(mostly\) know what they know\.CoRRabs/2207\.05221\.External Links:[Link](https://doi.org/10.48550/arXiv.2207.05221),[Document](https://dx.doi.org/10.48550/ARXIV.2207.05221),2207\.05221Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p1.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion, K\. Lukosiute, K\. Nguyen, N\. Cheng, N\. Joseph, N\. Schiefer, O\. Rausch, R\. Larson, S\. McCandlish, S\. Kundu, S\. Kadavath, S\. Yang, T\. Henighan, T\. Maxwell, T\. Telleen\-Lawton, T\. Hume, Z\. Hatfield\-Dodds, J\. Kaplan, J\. Brauner, S\. R\. Bowman, and E\. Perez \(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.CoRRabs/2307\.13702\.External Links:[Link](https://doi.org/10.48550/arXiv.2307.13702),[Document](https://dx.doi.org/10.48550/ARXIV.2307.13702),2307\.13702Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Li, Y\. Ma, K\. Ye, J\. Cao, M\. Zhou, and Y\. Zhou \(2025\)Hy\-facial: hybrid feature extraction by dimensionality reduction methods for enhanced facial expression classification\.arXiv preprint arXiv:2509\.26614\.Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2021\)TruthfulQA: measuring how models mimic human falsehoods\.External Links:2109\.07958v2,[Link](https://arxiv.org/abs/2109.07958v2)Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p1.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Mallen, M\. Brumley, J\. Kharchenko, and N\. Belrose \(2023\)Eliciting latent knowledge from quirky language models\.External Links:2312\.01037v4,[Link](https://arxiv.org/abs/2312.01037v4)Cited by:[Appendix A](https://arxiv.org/html/2605.28825#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.28825#S1.p2.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px3.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.External Links:2202\.05262v5,[Link](https://arxiv.org/abs/2202.05262v5)Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p2.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.28825#S3.SS1.p3.1),[§4\.3](https://arxiv.org/html/2605.28825#S4.SS3.p2.1),[§4\.4](https://arxiv.org/html/2605.28825#S4.SS4.SSS0.Px1.p1.1)\.
- K\. Meng, A\. S\. Sharma, A\. J\. Andonian, Y\. Belinkov, and D\. Bau \(2023\)Mass\-editing memory in a transformer\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=MkbcAHIYgyS)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Nanda, L\. Chan, T\. Lieberum, J\. Smith, and J\. Steinhardt \(2023\)Progress measures for grokking via mechanistic interpretability\.External Links:2301\.05217v3,[Link](https://arxiv.org/abs/2301.05217v3)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, S\. Johnston, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2022\)In\-context learning and induction heads\.CoRRabs/2209\.11895\.External Links:[Link](https://doi.org/10.48550/arXiv.2209.11895),[Document](https://dx.doi.org/10.48550/ARXIV.2209.11895),2209\.11895Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2023\)The linear representation hypothesis and the geometry of large language models\.External Links:2311\.03658v2,[Link](https://arxiv.org/abs/2311.03658v2)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px2.p1.1),[§3\.7](https://arxiv.org/html/2605.28825#S3.SS7.1.p1.3),[§4\.4](https://arxiv.org/html/2605.28825#S4.SS4.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2605.28825#S5.p2.1)\.
- S\. Si, W\. Ma, H\. Gao, Y\. Wu, T\. Lin, Y\. Dai, H\. Li, R\. Yan, F\. Huang, and Y\. Li \(2023\)SpokenWOZ: a large\-scale speech\-text benchmark for spoken task\-oriented dialogue agents\.InThirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=viktK3nO5b)Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p1.1)\.
- S\. Si, H\. Zhao, G\. Chen, Y\. Li, K\. Luo, C\. Lv, K\. An, F\. Qi, B\. Chang, and M\. Sun \(2025a\)GATEAU: selecting influential samples for long context alignment\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 7380–7411\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.375/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.375),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p2.1)\.
- S\. Si, H\. Zhao, K\. Luo, G\. Chen, F\. Qi, M\. Zhang, B\. Chang, and M\. Sun \(2025b\)A goal without a plan is just a wish: efficient and effective global planner training for long\-horizon agent tasks\.External Links:2510\.05608,[Link](https://arxiv.org/abs/2510.05608)Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p2.1)\.
- K\. R\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt \(2023\)Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=NpsVSN6o4ul)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Xin, J\. Du, Q\. Wang, Z\. Lin, and K\. Yan \(2024a\)Vmt\-adapter: parameter\-efficient transfer learning for multi\-task dense scene understanding\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 16085–16093\.Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Xin, J\. Du, Q\. Wang, K\. Yan, and S\. Ding \(2024b\)Mmap: multi\-modal alignment prompt for cross\-domain multi\-task learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 16076–16084\.Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Xin, Q\. Qin, S\. Luo, K\. Zhu, J\. Yan, Y\. Tai, J\. Lei, Y\. Cao, K\. Wang, Y\. Wang,et al\.\(2025\)Lumina\-dimoo: an omni diffusion large language model for multi\-modal generation and understanding\.arXiv preprint arXiv:2510\.06308\.Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p1.1)\.
- Z\. Yu and S\. Ananiadou \(2023\)Neuron\-level knowledge attribution in large language models\.External Links:2312\.12141v4,[Link](https://arxiv.org/abs/2312.12141v4)Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Zhang, J\. Lu, Y\. Du, Y\. Gao, L\. Huang, B\. Wang, F\. Tan, and P\. Zou \(2025\)MARINE: theoretical optimization and design for multi\-agent recursive in\-context enhancement\.arXiv preprint arXiv:2512\.07898\.Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p1.1)\.
- Y\. Zhou, X\. Geng, T\. Shen, C\. Tao, G\. Long, J\. Lou, and J\. Shen \(2023\)Thread of thought unraveling chaotic contexts\.arXiv preprint arXiv:2311\.08734\.Cited by:[§1](https://arxiv.org/html/2605.28825#S1.p2.1)\.
- Y\. Zhou, H\. Li, and J\. Shen \(2026\)Condition errors refinement in autoregressive image generation with diffusion loss\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Zhou, J\. Shen, and Y\. Cheng \(2025\)Weak to strong generalization for large language models with multi\-capabilities\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Zou, Z\. Wang, N\. Carlini, M\. Nasr, J\. Z\. Kolter, and M\. Fredrikson \(2023\)Universal and transferable adversarial attacks on aligned language models\.External Links:2307\.15043v2,[Link](https://arxiv.org/abs/2307.15043v2)Cited by:[Appendix A](https://arxiv.org/html/2605.28825#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.28825#S1.p2.1),[§2](https://arxiv.org/html/2605.28825#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.28825#S4.SS1.SSS0.Px3.p1.1)\.

## Appendix AImplementation Details

#### SAE Configuration\.

We use SAEs with dictionary sizen=65536n=65536and sparsity coefficientαSAE=5×10−4\\alpha\_\{\\text\{SAE\}\}=5\\times 10^\{\-4\}, followingGaoet al\.\([2024](https://arxiv.org/html/2605.28825#bib.bib10)\)\. SAEs are applied to the residual stream at every layer\. For Llama\-3\-8B, we use the publicly available SAEs from the EleutherAI interpretability suite; for Llama\-3\-70B and Mistral\-7B, we train SAEs using the same configuration on 10B tokens of The Pile\.

#### Hyperparameters\.

The number of candidate features isk=20k=20\. The CKS perturbation magnitude isϵ=0\.1\\epsilon=0\.1\. The CKS threshold isτ=0\.15\\tau=0\.15, calibrated on a 10% held\-out split of each dataset\. The elicitation strength isλ=1\.2\\lambda=1\.2, calibrated on the same split\. All experiments are run on 4×\\timesA100 80GB GPUs\.

#### Baseline Implementation\.

CCS is implemented followingMallenet al\.\([2023](https://arxiv.org/html/2605.28825#bib.bib12)\)with the recommended hyperparameters\. RepE uses the “honesty” direction computed from 200 contrast pairs followingZouet al\.\([2023](https://arxiv.org/html/2605.28825#bib.bib42)\)\. Direct probing uses a logistic regression probe trained on 80% of each dataset with L2 regularization \(C=1\.0C=1\.0\)\.

## Appendix BAdditional Results

Table[5](https://arxiv.org/html/2605.28825#A2.T5)provides complete results across all model\-dataset combinations, including standard deviations over 3 random seeds\.

Table 5:Full results with standard deviations \(3 seeds\)\.
MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

Similar Articles

Applied Explainability for Large Language Models: A Comparative Study

Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures

Decomposing and Steering Functional Metacognition in Large Language Models

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Submit Feedback

Similar Articles

Applied Explainability for Large Language Models: A Comparative Study
Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures
Decomposing and Steering Functional Metacognition in Large Language Models
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations