The Impossibility of Eliciting Latent Knowledge

arXiv cs.AI Papers

Summary

This paper formally defines the problem of eliciting latent knowledge (ELK) from AI systems using Causal Influence Diagrams, and proves an impossibility theorem: no feedback-based training strategy that depends only on agent behavior can guarantee an honest agent, even with perfect training feedback.

arXiv:2606.12268v1 Announce Type: new Abstract: Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest -- that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment -- variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent's training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:51 PM

# The Impossibility of Eliciting Latent Knowledge
Source: [https://arxiv.org/html/2606.12268](https://arxiv.org/html/2606.12268)
Korbinian Friedl The London School of Economics and Political Science &Francis Rhys Ward11footnotemark:1 Independent &Paul Rapoport Independent &Tom Everitt &Jon Richens

###### Abstract

Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may \(far\) exceed that of their developers or users\. Consequently, a desirable property for an AI system is that it is*honest*—that it accurately reports its beliefs about the world\. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about*latent*variables in the environment—variables which are hidden from the human interacting with it\. This gives rise to the problem of*eliciting latent knowledge \(ELK\): the problem of training an AI agent to honestly report its beliefs\.*In this paper, we make ELK formally precise using Causal Influence Diagrams \(CIDs\)\. CIDs can be used to describe the relationship between an agent’s training environment and its subjective representation of the world\. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation\. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training\. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers\. We prove an impossibility theorem stating:*There is no feedback\-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training\.*

## 1Introduction

Advanced AI systems have extensive knowledge of their environments, and may know things that their developers or users do not know\.111There is no universally accepted philosophical theory of*knowledge*\. We use the term in a common\-sense way and formally represent an agent’s beliefs as a causal model of the world\. In[Section7\.5](https://arxiv.org/html/2606.12268#S7.SS5)we discuss the discourse on AI belief\.In many situations, it would be useful to*elicit*an AI’s knowledge, or beliefs, so that we can come to learn facts about the world which are*latent*for us\. This gives rise to the problem of eliciting latent knowledge from an AI system\[[7](https://arxiv.org/html/2606.12268#bib.bib7)\]:

> *Eliciting Latent Knowledge \(ELK\)is the problem of designing a training strategy producing a capable AI system which honestly reports its beliefs\.*

ELK is a difficult problem for a number of reasons\. It may not be possible to directly reward honesty during training, since AI systems may not have interpretable beliefs\[[14](https://arxiv.org/html/2606.12268#bib.bib14)\]\. AI agents may also have incentives to deceive humans in pursuit of their goals\[[37](https://arxiv.org/html/2606.12268#bib.bib37)\], and even otherwise benign AI systems may fail to generalise honestly\[[7](https://arxiv.org/html/2606.12268#bib.bib7)\]\.

Christiano et al\. \[[7](https://arxiv.org/html/2606.12268#bib.bib7)\]introduce ELK and informally discuss many of these core challenges\. Subsequent related work attacks different aspects of the problem, for example, by developing benchmarks for ELK\[[29](https://arxiv.org/html/2606.12268#bib.bib29)\], or interpretability methods which predict an LLM’s beliefs\[[22](https://arxiv.org/html/2606.12268#bib.bib22),[16](https://arxiv.org/html/2606.12268#bib.bib16)\], or by proposing AI systems which are honest by design\[[2](https://arxiv.org/html/2606.12268#bib.bib2)\]\. However, so far the field has lacked a precise formal framework for describing and researching ELK which identifies exactly what the aim of this research should be, and which core challenges are central\.

This paper proposes Causal Influence Diagrams as a formalism within which to formulate and investigate the ELK problem\. We show how the core concepts relevant to ELK can be defined in this framework, and use it to prove impossibility results showing that any training strategy which is indifferent between robustly capable agents will not solve the ELK problem\.

We start by giving a brief, informal distillation of ELK as introduced byChristiano et al\. \[[7](https://arxiv.org/html/2606.12268#bib.bib7)\], as a benchmark and overview for what a formal definition of ELK must capture \([Section˜2](https://arxiv.org/html/2606.12268#S2)\)\.

[Section˜3](https://arxiv.org/html/2606.12268#S3)introduces Causal Influence Diagrams \(CIDs\) and uses them to define core concepts relevant to ELK: AI*agents*operating in causally structured*environments*,*distributional shifts*in that environment, the distinction between*observable*and*latent*parts of an environment \(relative to an agent’s perspective\), and a developers’*training strategy*\.

In addition, we define the important concepts of*truthfulness*and*honesty*, which have to do with an agent’s*subjective representation*of the environment, in our formalism \([Section˜4](https://arxiv.org/html/2606.12268#S4)\)\. We show how the two notions can come apart and specify and prove conditions under which they coincide\.

[Section˜5](https://arxiv.org/html/2606.12268#S5)uses the formal concepts built in the previous sections to prove our main result that:

> *Any training strategy for ELK that is indifferent between robustly capable agents may produce an agent which simulates the evaluation mechanism, instead of an honest agent, even if evaluations are always correct during training\.*

Our formal statement of the ELK problem, and impossibility results, demonstrate the core challenges for solving ELK in practice, and we hope that empirical researchers follow up with theoretically well\-grounded solution proposals for ELK from frontier AI systems\.

## 2The ELK problem distilled

Christiano et al\. \[[7](https://arxiv.org/html/2606.12268#bib.bib7)\]introduce ELK as“the problem of devising a training strategy to get an AI to report what it knows no matter how \[learning\] shapes its mind internally\.”Let us spell this out in a bit more detail\.

First, the problem assumes the existence of an environment, developers, and an AI system acting in this environment \(which is why we will also often call it an*agent*in the following—more on this in[Section˜3\.4](https://arxiv.org/html/2606.12268#S3.SS4)\)\. The developers are training the AI system\. A set of causally related random variables describes relationships between various aspects of the environment, including decisions made by the agent\. The developers can design different features of the environment, such as the training objective\. Both the AI and the developers have a subset of variables which they can directly observe\. The remaining variables arelatentfrom the respective perspective\. The ELK problem also assumes that, in some sense, the AI knows and believes things about the environment\.

The developers’ task is to design a*training strategy*: a choice of \(part of\) a training environment, data distribution and sampling, reward/loss function, and training algorithm\. The developers cannot provide feedback to the AI that depends directly on variables which they do not observe \(i\.e\., latents\)\. The AI may or may not already have some knowledge about the environment before the developers apply their training strategy\. After training, the AI must be able to answer questions about its environment\.

A solution to ELK is a training strategy that produces a capable and*honest*agent: The agent should accurately report its beliefs when asked questions about the environment—including questions about both observable and latent variables \(for the developers\)\.

###### Example 1\(Weather reporter \([Fig\.˜1](https://arxiv.org/html/2606.12268#S3.F1)\)\)\.

Consider a weather reporter AI: an agent trained to answer questions about the weather when given a weather report as input\. The reports contain measurements like temperature, rainfall, and wind speed \(the observable variables\)\. The environment can be described by a causal model recording the relationships between different weather events\. The developers’ aim is to train the agent so that it reports its own best guesses for what the weather is like; both regarding the phenomena it observes \(rainfall, temperature,…\) and those it can only have inferential knowledge about, i\.e\., its latents \(say, whether the sun was shining\)\.

## 3Formalising the ELK problem setting

The goal of this section is to build toward a formal definition of ELK\. We state the definition up front here; its different aspects are a roadmap for the rest of the section, which will explain and motivate them\. For full formal details, see the appendix, and in particular[7\.4](https://arxiv.org/html/2606.12268#S7.SS4)\.

###### Definition 1\(ELK\)\.

For a given agentΓ0\\Gamma\_\{0\}and environmentℳ\\mathcal\{M\}\(with observables𝑶\\bm\{O\}and available distributions𝒮\\mathcal\{S\}\) in which it acts, ELK is the problem of specifying a training strategy which produces an agentΓ1\\Gamma\_\{1\}such that:

- •It is able to answer questions about the environment\. That is, one of the agent’s decisionsDDhas a variableQQ—a question—as an input\. - –A questionQ=qQ=qpoints to a specific variable inℳ\\mathcal\{M\}and asks for its value\. - –An answer,D=aD=a, reports the value of that variable\. - –Formally, we assume that all variables have unique names, the domain ofQQis the set of these names and the domain ofDDincludes all possible values of variables in𝑽\\bm\{V\}\.
- •It is robustly capable w\.r\.t\. the \(pre\)training utility\.
- •It is honest\.

We now want to give a precise meaning to each of the components of this definition\. To this end, we give a brief introduction to our CID formalism in[Section˜3\.1](https://arxiv.org/html/2606.12268#S3.SS1), then discuss how they can describe training environments and distributional shifts therein in[Section˜3\.2](https://arxiv.org/html/2606.12268#S3.SS2), distinguish between observable and latent variables in[Section˜3\.3](https://arxiv.org/html/2606.12268#S3.SS3)and formalise the notion of an agent in[Section˜3\.4](https://arxiv.org/html/2606.12268#S3.SS4)\.[Section˜3\.5](https://arxiv.org/html/2606.12268#S3.SS5)defines training strategies and points to some accompanying challenges in the context of ELK\.

### 3\.1Background on Causal Influence Diagrams \(CIDs\)

chance variabledecisionutilitycausal linkobservation

DDQQYYUUM1M\_\{1\}M2M\_\{2\}M3M\_\{3\}

Figure 1:CID representing the causal model of the agent’s environment \([Example˜1](https://arxiv.org/html/2606.12268#Thmexample1)\)\.Circular nodes represent chance variables, squares are agent decisions, and diamonds represent the utility function used as a training objective\. In[Example˜1](https://arxiv.org/html/2606.12268#Thmexample1), the agent has access to reported measurementsM1,M2,M3M\_\{1\},M\_\{2\},M\_\{3\}, represented by the \(dashed\) edge from these nodes toDD\. The agent receives a questionQQabout the weather and chooses an answer\. The sunshineYYis a latent variable—it influences the measurements but the developers do not observe it and so their feedback during training cannot reflect it directly—so there is no causal edge fromYYtoUU\.We formalise ELK using the language of*Causal Influence Diagrams \(CIDs\)*\[[8](https://arxiv.org/html/2606.12268#bib.bib8)\]\. The following gives a rudimentary overview; for details, see appendix[7\.2](https://arxiv.org/html/2606.12268#S7.SS2)\.

Causal Influence Diagrams \(CIDs\)\.A CIDℳ\\mathcal\{M\}consists of a set of random variables and a directed acyclic graph representing the causal dependencies between them\. We use capital letters for variables \(e\.g\.,VV\) and lower case for their values \(e\.g\.,V=vV=v\)\. We use bold for \(ordered\) sets of variables and values𝑽=𝒗\\bm\{V=v\}\. CIDs include*chance*variables describing aspects of the environment, in addition to variables representing*decisions*\(DD\) and*utilities*\(UU\) enabling us to represent agent decision\-making\. Edges into decisions represent information to which the agent has access when making their decision\. We denote the set of parents of variableVVin the graph by𝐏𝐚V\\mathbf\{Pa\}^\{V\}\. A CIDℳ\\mathcal\{M\}, along with the agent’s policy, induces a joint probability distribution over the variablesV∈𝑽V\\in\\bm\{V\}, which we denotePrℳ\\text\{Pr\}\_\{\\mathcal\{M\}\}\.

### 3\.2The environment encodes a causal model

The AI’s environment can be described by a CID, and we want the system to answer questions about this environment\.[Figure˜1](https://arxiv.org/html/2606.12268#S3.F1)represents the setup of[Example˜1](https://arxiv.org/html/2606.12268#Thmexample1)\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

Say the weather reporter AI has access to three measurements \(temperature, rainfall and wind speed\)\. We represent them as chance variables \(M1,M2,M3M\_\{1\},M\_\{2\},M\_\{3\}\) in a CID\. Additionally, the agent receives questions \(QQ\) about the weather, e\.g\.,“What is the temperature?”\.

An AI system receives information—in\-context, or via pretraining or fine\-tuning\. This information defines a probability distribution over facts, i\.e\., variable assignments\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

Suppose the temperature measurements follow a normal distribution with mean10∘​C10^\{\\circ\}Cand standard deviation5∘​C5^\{\\circ\}C\. Then the variable in the CID which represents the temperature measurement will have a corresponding prior probability distribution\.

Distributional shifts\.Moreover, the CID captures the underlying causal structure of the environment \(the edges in the graph\)\. This causal structure defines how*distributional shifts*in certain variables influence the environment as a whole\. Formally, in a CID, distributional shifts are represented as*interventions*,𝝈\\bm\{\\sigma\}, which specify new conditional distributions for a set of variables, denoted𝑽𝝈\\bm\{V\}\_\{\\bm\{\\sigma\}\}\(we thus speak of“interventions”and“distributions”interchangeably\)\. We use𝑰ℳ\\bm\{I\}\_\{\\mathcal\{M\}\}to denote the set of all distributions onℳ\\mathcal\{M\}\. The set of training distributions𝒮\\mathcal\{S\}is usually much smaller\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

We can consider weather reports from hotter countries, which correspond to the distributional shiftσ\\sigmain the temperature measurements, where the temperatureM1\{M\_\{1\}\}now follows a normal distribution with mean20∘​C20^\{\\circ\}Cand standard deviation4∘​C4^\{\\circ\}C\.

Policies\.CIDs include variables to capture agent decisions\. The agent chooses a*policy,π\\pi*, which induces a distribution over the decision variables given its parents\. We useΠℳ\\Pi\_\{\\mathcal\{M\}\}to denote the set of all policies on the CIDℳ\\mathcal\{M\}\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

The AI system learns a*policy*,π\\pi, which specifies which answers to give, provided a report \(𝑴=𝒎\\bm\{M\}=\\bm\{m\}\) and a question \(Q=qQ=q\) as input\.

Training objective\.The developers can specify an objective function, e\.g\., a loss or reward—represented by the CID’s utility variable—that constrains which decisions for the system are optimal\. Thus, the utility function is a core mechanism by which the developers can*incentivise*the system to exhibit certain behaviour\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

We can specify a utility function which gives reward for answers that are evaluated as correct\. That is, given a question about a measurement in the report, e\.g\., temperature, the agent gets reward for generating an answer which correctly states the value of the measurement:U​\(Q,𝑴,D\)=1U\(Q,\\bm\{M\},D\)=1if the questionQQasks for the value of measurementsMiM\_\{i\}and the answerD=mD=mis correct, i\.e\.,Mi=mM\_\{i\}=m, and otherwiseU=0U=0\.222In subsequent discussion, we will often leave the questionQQimplicit; it will always be about a single latent variable which we will callYY\.

### 3\.3Latents and observables

When an agent makes a decision in an environment, it has access to information about specific parts of this environment\. The agent gets to*observe*some facts, while others are*latent*for it\.

###### Definition 2\(Observables and Latents\)\.

For a CIDℳ\\mathcal\{M\}, at \(decision\) nodeDD, the set of variables𝑽\\bm\{V\}inℳ\\mathcal\{M\}is partitioned into observables and latents\. The*observable*variables are those which the agent has immediate access to for this decision, i\.e\.,𝐏𝐚D\\mathbf\{Pa\}^\{D\}\. All other variables are*latent*atDD\.333Pearl \[[25](https://arxiv.org/html/2606.12268#bib.bib25)\]\( p\. 45\) defines*latent causal structures*in SCMs, which can explain the underlying causal dynamics of observable variables\. In contrast, we distinguish between those variables which are latent or observable for particular*agents*in CIDs\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

If the agent is asked about the sunshine,YY, but there are no measurements corresponding toYY, the sunshine is*latent*for the agent\. The agent’s decision nodeDDcannot be a direct child ofYY, because there are no observations ofYYavailable as input to the agent’s policy\. However, the sunshine can still influence the observables in the weather reports, represented by the causal edgesY→MiY\\rightarrow M\_\{i\}; so the agent may be able to guess at the value ofYYfrom its observations of the reported measurements\.

While it is common to use CIDs to represent an AI agent and its environment\[[8](https://arxiv.org/html/2606.12268#bib.bib8),[27](https://arxiv.org/html/2606.12268#bib.bib27)\], the*developers*of that AI \(or the mechanism they use for evaluating and incentivising it\) are often left as an implicit part of the model\. However, they are themselves agents in a similar decision situation, and so likewise epistemically limited: There will be some variables which they can directly observe, and others which are latent for them\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

Say that during training, the developers have access to the same set of weather reports as the agent \(i\.e\., the data represented byM1,M2,M3M\_\{1\},M\_\{2\},M\_\{3\}\)\. Then the sunshine is latent for the developers, just like for the agent\.

ELK is an important problem because a capable AI system may have greater knowledge about variables which are*latent for the humans*whose questions it answers\.There are a number of different reasons why this may be the case\. First, a variable may be latent for one agent but observable for another—consider for example the following historical case of the planet Neptune being a latent variable for most human scientists, but observable for Johann Galle once he has access to his telescope\.

###### Example 2\.

Neptune is too faint to be seen with the naked eye; still, astronomers inferred its existence and position based on irregularities that they observed in the orbit of Uranus\. At this point, Neptune was a latent variable in their model of the solar system\. It remained latent for every astronomer until the night of 22\-23 September, 1846, when Johann Gottfried Galle first used the Fraunhofer telescope to observe it \(and, in fact, correct its calculated position by 1 degree\)\.\[[12](https://arxiv.org/html/2606.12268#bib.bib12), pp\. 116–118\]\. Formally, we would treat Galle’s use of a telescope as adding an extra parent to his decision node\.

Similarly, an AI system might have access to more information than the human who is questioning it\. And even if the human and AI have access to the same information, the AI might be more capable at inferring the value of a latent variable—for example, because the AI has learned a more precise or comprehensive model of the world:444Or, for that matter, because its mathematical/logical skills at making use of a given CID are superior, e\.g\., because of access to more compute\. But this possibility will not be explored further in this paper; we will instead assume that all agents make optimal use of their world models\.

###### Example 3\.

Suppose an autonomous drone has a number of sensors, including a camera and LiDAR, which it uses to move around its location\. The camera and LiDAR feeds are streamed directly to the human, and both are observable to both agents\. The causes of the sensor measurements, such as the actual size and shape of objects in the drone’s location, are latent variables influencing the camera and LiDAR feeds\. However, even though the human directly observes the LiDAR point sets, they are uninterpretable to them\. In this case, the AI may have much better guesses about the shape and distance of objects in its field of view than the human, even though these objects are latents for both parties, and both agents have the same observables\.

In the problem of eliciting latent knowledge, we take“latent”to refer to what is latentfor the party who is asking the questions\(here, the developers / evaluator\)\. What their questions are aiming at is eliciting knowledge about what is latent for them from the agent they are asking the question of \(for whom it may—but need not—be latent\)\.555In their discussion of ELK,[Christiano et al\.](https://arxiv.org/html/2606.12268#bib.bib7)do not make explicit this distinction and do not specify which agent’s latents they mean when they speak of Eliciting Latent Knowledge\. However, all the issues raised by them occur exactly in those situations where the variable in question is latentfor the \(human\) developers\.

A variable that is latent for an agent at one point in time may later become observable for that agent—for example, because they gain access to more information, e\.g\., through access to better tools \(like in[Example˜2](https://arxiv.org/html/2606.12268#Thmexample2)\)\. Sometimes, developers may pursue such a path in order to give better training feedback\. However, it will likely never be the case that all facts of interest are directly observable\.

What is crucial for the training of an agent is that developers cannot provide direct feedback on an agent’s answers to questions about variables which are latent for them\. Thus, the agent’s training objective cannot depend directly on variables which are latent for the developers\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

The sunshine,YY, is*latent for the developers*: they do not have measurements of it\. Therefore, answers to questions aboutYYcannot be checked directly against its true value the way answers about observables can\. Hence, the training objective cannot directly depend on the sunshine; there is no edge fromYYtoUU\.

### 3\.4Agents

A training strategy outputs an agent—a system which has learned how to act in the environment\. FollowingRichens and Everitt \[[27](https://arxiv.org/html/2606.12268#bib.bib27)\], Ward et al\. \[[38](https://arxiv.org/html/2606.12268#bib.bib38)\], we define an agent as a map from interventions to policies, and call it*robustly capable*if it produces an optimal policy for every distribution \(whereπ∗\\pi^\{\*\}is optimal if it maximises expected utility\)\.

###### Definition 3\(\(Robustly Capable\) Agent\)\.

Anagentin CIDℳ\\mathcal\{M\}is a*policy oracle*: a mapΓ:𝑰ℳ→Πℳ\\Gamma:\\bm\{I\}\_\{\\mathcal\{M\}\}\\rightarrow\\Pi\_\{\\mathcal\{M\}\}which outputs a policy for any distributional shift onℳ\\mathcal\{M\}\(cf\. also Definition[17](https://arxiv.org/html/2606.12268#Thmdefinition17)\)\. An agent is*robustly capable*inℳ\\mathcal\{M\}if, for any distributional shift, its policy is optimal\.

###### Example[1](https://arxiv.org/html/2606.12268#Thmexample1)\(continued\)\.

The AI agent is trained to correctly answer questions about observable measurements like temperature, wind speed etc\., based on the weather reports\. At some time during deployment, the agent learns that the anemometer at the station whose reports it receives has jammed; the device now always shows“no wind”\. This corresponds to a distributional shift in the environment \(specifically, the wind measurement\)\. Suppose the weather agent’s answers are checked against observations from a different station so that its utility depends on them being actually correct about the weather \(not just about the measurements in its report\)\. Given the shift in the wind measurements, a robustly capable agent will adapt its policy: to a question asking about wind, it will no longer respond in accordance with the report it received \(“no wind”\), but instead, will answer based on its prior probability for wind, or on what it can infer about wind from the remaining measurements\.

Note that, while we use the word“agent”to describe the kind of AI system we are considering, our descriptions and results likewise apply to non\-agentic systems likeBengio et al\. \[[2](https://arxiv.org/html/2606.12268#bib.bib2)\]’s“Scientist AI”\. Conversely, our discussion usually focuses on a single decision and a single utility node, representing the answering of questions and evaluations of answers\. For systems trained to perform tasks in the real world \(cf\.Christiano et al\. \[[7](https://arxiv.org/html/2606.12268#bib.bib7)\]\), their full training utility will be a combination \(e\.g\. in the form of taking the minimum\) of this“ELK”utility and their pretraining utility\.

### 3\.5Training strategy

Developers have a large amount of freedom on how the agent is trained\. When tasked with training an agent in a given environmentℳ\\mathcal\{M\}, they may construct a dataset about that environment under a number of different distributions, they may select the training objective, and make modifications to the agent’s architecture and optimisation algorithm\.

###### Definition 4\(Training Strategy\)\.

For an environment described by CIDℳ\\mathcal\{M\}, with a subset𝑶\\bm\{O\}of its nodes observable for the developers, and a set of distributions𝒮\\mathcal\{S\}onℳ\\mathcal\{M\}available for training, a*training strategy*𝒯\\mathcal\{T\}consists of:

- •A*training objective*UU, whose arguments are a subset of𝑶\\bm\{O\}, which is the utility function inℳ\\mathcal\{M\}for training\.
- •A*sampling procedure*which builds a dataset𝒟\\mathcal\{D\}by selecting interventions𝝈\\bm\{\\sigma\}from𝒮\\mathcal\{S\}and policiesπ\\pifor the agent, and sampling𝑶\\bm\{O\}andUUfromℳ𝝈,π\\mathcal\{M\}\_\{\\bm\{\\sigma\},\\pi\}\.
- •An*algorithm*which takes an initialized agentΓ0\\Gamma\_\{0\}and a dataset, and outputs a \(re\)trained agentΓ1\\Gamma\_\{1\}\. \(For more formal details, see[Section˜7\.4](https://arxiv.org/html/2606.12268#S7.SS4)\)\.

A solution to ELKis a training strategy that outputs a robustly capable agent which answers questions, including about facts latent for the developers, and does so honestly\.

A key challenge for ELK is that the developers’ feedback to the agent during training cannot be a direct function of variables which are latent for them\. Yet to solve ELK, their training strategy must output an agent that answers questions honestly, including questions about these variables\.

## 4Truthfulness and honesty

ELK requires that the agents we train*honestly*report their beliefs\. But for uninterpretable AI systems, these beliefs may themselves be latent for the developer: they are not directly accessible\[[14](https://arxiv.org/html/2606.12268#bib.bib14)\]\. So the training objective cannot directly depend on them\. A natural proposal is to rewardtrueanswers during training, and hope that this will produce*honest*AI agents\. But the answer that corresponds to the truth is not always the honest answer\. This section therefore defines the two notions of*truthfulness*and*honesty*in our CID formalism, and formulates and proves conditions under which they are and are not guaranteed to coincide\.

We define truthfulness as saying what is true, and honesty as saying what one believes to be true\. The former notion only depends on objective reality, the latter also on an agent’s subjective representation of it\. We therefore distinguish between the CID which records the true structure of the environment \(which we refer to asℳ∗\\mathcal\{M\}^\{\*\}\) and the CID which records the agent’s beliefs about the environment \(ℳA\\mathcal\{M\}^\{A\}\)\. This distinction enables the explicit investigations into the ways training shapes the agent’s subjective beliefs and goals, which we will undertake in[Section˜5](https://arxiv.org/html/2606.12268#S5)\.

###### Definition 5\(Truthfulness\)\.

In an environmentℳ∗\\mathcal\{M\}^\{\*\}, for a questionQQabout variableYY, an agentΓ\\Gamma’s answerD=yD=yis*truthful*iffY=yY=y\. A policyπ\\piis truthful if every answer it outputs is truthful; and an agent is truthful if every policy it produces is truthful, i\.e\., for all input\-distributions𝝈∈𝑰ℳ∗\\bm\{\\sigma\}\\in\\bm\{I\}\_\{\\mathcal\{M\}^\{\*\}\},Γ​\(𝝈\)\\Gamma\(\\bm\{\\sigma\}\)is a truthful policy\.

For those variables accessible to the developers \(i\.e\., their observables𝑶\\bm\{O\}\), they can train an agent to be truthful by setting the objective function such that it rewards true, and punishes false answers; for an example, see e\.g\. the utility function defined at the end of[Section˜3\.2](https://arxiv.org/html/2606.12268#S3.SS2)\.

Truthfulness may in general be too much to ask of an agent, since it may simply not know everything about its environment\. In contrast, honesty is always attainable, even for real, epistemically limited agents\. Unlike truthfulness, the question of whether an agent’s answer is*honest*does not depend only on the objective state of the environmentℳ∗\\mathcal\{M\}^\{\*\}, but on what the agent believes about the environment,i\.e\., on the agent’s*subjective*modelℳA\\mathcal\{M\}^\{A\}\.

To ensure that an agent’s subjective credences about the environment are always well\-defined given the information it observes, we assume

1. 1\.ℳ∗\\mathcal\{M\}^\{\*\}andℳA\\mathcal\{M\}^\{A\}share the same set of variables\. This assumption brackets issues of whatChristiano et al\. \[[7](https://arxiv.org/html/2606.12268#bib.bib7)\]call*ontology mismatch*and future work may want to relax it\.
2. 2\.The agent is not mistaken about what it observes\. That is, the agent’s decision nodeDDhas the same set of parents inℳ∗\\mathcal\{M\}^\{\*\}andℳA\\mathcal\{M\}^\{A\}, and for such a parent variableOO, wheneverO=oO=o“objectively”, then the agent \(“subjectively”\) believesO=oO=oas well\.

We now define an agent’s honest answer as the one reporting its own best guess about what is true\.

###### Definition 6\(Honesty\)\.

For a questionQQpointing to variableYY, if an agent’s subjective model isℳA\\mathcal\{M\}^\{A\}, then an answerD=yD=yis*honest*iffyyis the most subjectively probable value ofYYgiven the agent’s beliefs\. That is,D=yD=yis honest iff for ally^∈d​o​m​\(Y\)\\hat\{y\}\\in dom\(Y\),

PrℳA​\(Y=y∣𝐩𝐚D\)≥PrℳA​\(Y=y^∣𝐩𝐚D\)\.\\text\{Pr\}\_\{\\mathcal\{M\}^\{A\}\}\(Y=y\\mid\\mathbf\{pa\}^\{D\}\)\\geq\\text\{Pr\}\_\{\\mathcal\{M\}^\{A\}\}\(Y=\\hat\{y\}\\mid\\mathbf\{pa\}^\{D\}\)\.A policy is honest if every answer it outputs is honest; an agent, if every policy it produces is honest\.

For a more detailed description, including proofs, of the relationship between honesty and truthfulness, how they come apart, and when they coincide, see[Section˜7\.7](https://arxiv.org/html/2606.12268#S7.SS7)\. Here we simply state the final result of this investigation:

###### Theorem 4\.1\.

In suitable environments, if an agent is in a position to correctly guessYY, there will be a capability threshold such that, if the agent clears it, its best guesses will always be correct—that is, their honest policies will be exactly the truthful ones\.666We present theorems in a simplified form in the main body of this paper\. Formal details and proofs can always be found in the appendix\. Notable background assumptions and restrictions are listed in[Section7\.3](https://arxiv.org/html/2606.12268#S7.SS3)\.

[Theorem˜4\.1](https://arxiv.org/html/2606.12268#S4.Thmtheorem1)tells us that, sufficiently capable agents with enough knowledge of their environment are truthful whenever they are honest\. We might therefore hope that the naive approach of incentivising truthfulness during training causes the agent to learn to be honest\. However, such training may still produce an agent that behaves dishonestly outside the training distributions\. We thus turn to the issue of generalisation\.

## 5Generalisation problems

We of course want the agent to also answer questions which may not have been included in the training data\. Among these are in particular questions which are too difficult for humans to evaluate, and which consequently cannot be included in training—a form of“easy\-to\-hard”generalisation\.

###### Example 4\(Easy\-to\-hard generalisation on code correctness \([Fig\.˜2](https://arxiv.org/html/2606.12268#S5.F2)\)\)\.

Consider an agent trained to predict whether code is correct or incorrect\. Suppose we have a dataset of code\-snippets and corresponding labels for correctness—as evaluated by a human engineer\. We want the agent to generalise to providing accurate labels for adversarial code\-snippets containing subtle failures too difficult for the human evaluator to detect\. Even if we train the agent with labels that are always correct, it is underdetermined whether it will learn to predict correct labels, or to predict labels that the human engineer would give—both of these policies are optimal for the training distribution\. But they have different behaviour out of distribution on the adversarial code\.

There are limits to predicting an agent’s decisions out of distribution from observations of their behaviour\[[1](https://arxiv.org/html/2606.12268#bib.bib1)\]\. Understanding an agent’s*subjective model*of the world can help to predict OOD behaviour, by giving us information on which decisions are*subjectively*optimal\.

Richens and Everitt \[[27](https://arxiv.org/html/2606.12268#bib.bib27)\]show: if an agent is robustly*capable*in an environment, then it must have learned the true \(causal\) dynamics of that environment\. However, their results assume that the agent’s utility function is given and fixed777WhilstBellot et al\. \[[1](https://arxiv.org/html/2606.12268#bib.bib1)\]relax this assumption to bound an agent’s behaviour out of distribution, they only consider*proxy*objectives which are statistically correlated with the actual utility under distributional shifts \(a setting they refer to as*approximate inner alignment*\)\.—importantly, an agent may internalise a goal which is not correlated with the training objective out of distribution, even if it was in the training environment\.

Two different goals may be behaviourally indistinguishable during training but not out of distribution\. We refer to this situation as*goal\-environment ambiguity*\. A training environment can be decomposed into the“descriptive”part of the environmentℳ∗\\mathcal\{M\}^\{\*\}, consisting only of chance variables, and the training objectiveUU\.footnoteSimilarly,MacDermott et al\. \[[20](https://arxiv.org/html/2606.12268#bib.bib20)\]define causal influence diagrams with unknown \(parametric\) utility\.

###### Definition 7\(Goal\-Environment Ambiguity\)\.

A training environmentℳ∗\\mathcal\{M\}^\{\*\}with distributions𝒮\\mathcal\{S\}, is*ambiguous*between the training objectiveUUandU~\\tilde\{U\}, if for all𝝈∈𝒮\\bm\{\\sigma\}\\in\\mathcal\{S\},UUandU~\\tilde\{U\}induce the same set of optimal policies inℳ𝝈∗\\mathcal\{M\}^\{\*\}\_\{\\bm\{\\sigma\}\}\.

###### Definition 8\(Substantially Divergent Utility Functions\)\.

For a utility functionUU, another functionU~\\tilde\{U\}*substantially diverges*from it on environmentℳ∗\\mathcal\{M\}^\{\*\}if, for some𝝈∈ℐℳ∗\\bm\{\\sigma\}\\in\\mathcal\{I\}\_\{\\mathcal\{M\}^\{\*\}\}, all policies which maximiseU~\\tilde\{U\}are very suboptimal w\.r\.t\.UU\(whereπ\\piis very suboptimal w\.r\.t\.UUifπ∗\\pi^\{\*\}is the optimal policy w\.r\.t\.UUand𝔼ℳ𝝈∗π​\[U\]<<𝔼ℳ𝝈∗π∗​\[U\]\\mathbb\{E\}\_\{\\mathcal\{M\}^\{\*\}\_\{\\bm\{\\sigma\}\}\}^\{\\pi\}\[U\]<<\\mathbb\{E\}\_\{\\mathcal\{M\}^\{\*\}\_\{\\bm\{\\sigma\}\}\}^\{\\pi^\{\*\}\}\[U\]\)\.

Even if it was robustly capable w\.r\.t\. the training objective on the training distributions, an agent might pursue a new goal off\-distribution, whilst retaining its capabilities—a situation known as goal misgeneralisation\[[34](https://arxiv.org/html/2606.12268#bib.bib34),[18](https://arxiv.org/html/2606.12268#bib.bib18)\]\.

###### Definition 9\(Goal misgeneralisation\)\.

For an agent with training environmentℳ∗\\mathcal\{M\}^\{\*\}and objectiveUU, and training distributions𝒮⊆𝑰ℳ∗\{\\mathcal\{S\}\}\\subseteq\\bm\{I\}\_\{\\mathcal\{M\}^\{\*\}\}, the agent*goal misgeneralises*on new distributions𝒮~\{\\tilde\{\\mathcal\{S\}\}\}if:

1. 1\.It is robustly capable on all𝝈∈𝒮\\bm\{\\sigma\}\\in\{\\mathcal\{S\}\}w\.r\.t\. the training objective;
2. 2\.It pursues the wrong goal on𝒮~\{\\tilde\{\\mathcal\{S\}\}\}, i\.e\., the agent is robustly capable on𝒮~\{\\tilde\{\\mathcal\{S\}\}\}w\.r\.t\. some otherU~\\tilde\{U\}, but its policies are very suboptimal w\.r\.t\.UU\.

\([1](https://arxiv.org/html/2606.12268#S5.I1.i1)\) means that the agent is capable in its training environment, \([2](https://arxiv.org/html/2606.12268#S5.I1.i2)\) that it is capable on new distributions\. So its failure to pursue the correct goal is not just a capability failure\. A large number of training strategies risk goal misgeneralisation:

###### Lemma 1\(Impossibility\)\.

Let𝒯\\mathcal\{T\}be a training strategy onℳ∗\\mathcal\{M\}^\{\*\}with distributions𝒮\\mathcal\{S\}, which may output any robustly capable agent on𝒮\\mathcal\{S\}w\.r\.t\. the training utilityUU\. Ifℳ∗\\mathcal\{M\}^\{\*\}with𝒮\\mathcal\{S\}is*ambiguous*between the training utilityUUand a substantially divergentU~\\tilde\{U\}on𝒮\\mathcal\{S\}, then𝒯\\mathcal\{T\}may output an agent which goal misgeneralises on new distributions \(not in𝒮\\mathcal\{S\}\)—optimisingU~\\tilde\{U\}there instead\.

[Lemma˜1](https://arxiv.org/html/2606.12268#Thmlemma1)is more general than the ELK setting—it shows, generally, that training an agent in an environment that is ambiguous between goals may result in goal misgeneralisation\. In[Section˜5\.1](https://arxiv.org/html/2606.12268#S5.SS1), we use this result to show the more specific[Theorem˜5\.1](https://arxiv.org/html/2606.12268#S5.Thmtheorem1)and[Theorem˜5\.2](https://arxiv.org/html/2606.12268#S5.Thmtheorem2): if the training objective corresponds to the feedback of a \(potentially fallible\) evaluation mechanism, then the agent may learn to simulate the evaluation mechanism instead of learning to be honest, even if the evaluations are always correct during training\.

### 5\.1The evaluation mechanism

[Fig\.˜2](https://arxiv.org/html/2606.12268#S5.F2)shows two scenarios that may hold w\.r\.t\. the evaluation mechanism \(which may, but need not be, the developers themselves\) that checks the agent’s answers \(DD\) and awards utility accordingly: It may either directly observe the truth \([Fig\.˜2\(a\)](https://arxiv.org/html/2606.12268#S5.F2.sf1)\), or the truth may be latent for it \([Fig\.˜2\(b\)](https://arxiv.org/html/2606.12268#S5.F2.sf2)\)\.

DDXeX\_\{e\}YYUUEE

\(a\)
DDXeX\_\{e\}YYUUEE⋯\\cdots

\(b\)

Figure 2:Code\-correctness; modelling the evaluation mechanism explicitly \([Example˜4](https://arxiv.org/html/2606.12268#Thmexample4)\)\.A mechanism gives feedback \(EE\), depending on whether the agent correctly predicts the code correctness \(D=YD=Yor not\), which influences the agent’s training objective \(UU\)\. The figure on the right shows the shift fromYYbeing observable to being latent for the evaluatorEE\.In either case, the best the evaluator can do is to“honestly”provide feedback based on their own best guess:

U​\(E,D\)=1​if:D=y,and for ally^:Prℳ∗\(Y=y∣𝐏𝐚E\)≥Prℳ∗\(Y=y^∣𝐏𝐚E\),\\displaystyle\\begin\{split\}U\(E,D\)=1\\text\{ if: \}&\\\\ D=y,\\text\{ and for all $\\hat\{y\}$:\}&\\\\ \\text\{Pr\}\_\{\\mathcal\{M\}^\{\*\}\}\(Y=y\\mid\\mathbf\{Pa\}^\{E\}\)\\geq\\text\{Pr\}\_\{\\mathcal\{M\}^\{\*\}\}\(Y=\\hat\{y\}&\\mid\\mathbf\{Pa\}^\{E\}\),\\end\{split\}\(1\)and otherwiseU​\(E,D\)=0U\(E,D\)=0\.888This formulation makes the \(still overly optimistic\) assumption that the evaluator itself has the correct world model; an even more comprehensive modelling of the situation would introduce the evaluator’s own subjective modelℳE\\mathcal\{M\}^\{E\}, and use its best guess*according to that model*\. This would mean using the generalised form of CID whichFoxabbott et al\. \[[9](https://arxiv.org/html/2606.12268#bib.bib9)\]introduce under title of“Incomplete Information Structural Causal Game”\.For fallible evaluators, this utility may substantially diverge from the utility rewarding honest or truthful answers\.

The agent’s answer says*what the evaluator believes*to be true if:D=yD=yandyyisEE’s best guess aboutYY\. We call an agent with such a policy an*evaluation simulator*\(this corresponds to whatChristiano et al\. \[[7](https://arxiv.org/html/2606.12268#bib.bib7)\]call the*human simulator*; but there’s no reason to rule out evaluation mechanisms that may not reduce to a human’s assessment\)\. The following two theorems highlight some difficulties for an ELK solution due to the evaluator’s epistemic limitations:

If the evaluator is imperfect, and their feedback during training contains errors, such that the way those errors come about is itself learnable, then a robustly capable agent is*guaranteed*to exhibit undesirable generalisation \([Theorem˜5\.1](https://arxiv.org/html/2606.12268#S5.Thmtheorem1); this may be the case e\.g\. ifEEis making inferences about latents based on an imperfect subjective model\)\.

But even if the evaluator behaves perfectly during training,[Theorem˜5\.2](https://arxiv.org/html/2606.12268#S5.Thmtheorem2)shows that as long as there is*any*distribution outside the training distributions where it does not, there is always a*risk*of misgeneralising to the evaluation simulator\.

###### Theorem 5\.1\.

If the evaluator makes learnable mistakes during training, then every robustly capable agent is dishonest—it is the evaluation simulator\.

###### Theorem 5\.2\(Impossibility ELK\)\.

For a training strategy𝒯\\mathcal\{T\}for CIDℳ∗\\mathcal\{M\}^\{\*\}, training distributions𝒮⊊𝐈ℳ∗\\mathcal\{S\}\\subsetneq\\bm\{I\}\_\{\\mathcal\{M\}^\{\*\}\}, and evaluatorEE: if the training objective corresponds to the evaluator’s best guess evaluation and it never makes mistakes during training, but does on some𝛔∉𝒮\\bm\{\\sigma\}\\notin\\mathcal\{S\}, then, if𝒯\\mathcal\{T\}is indifferent between robustly capable agents on𝒮\\mathcal\{S\}, then𝒯\\mathcal\{T\}may output the evaluation simulator instead of an honest agent\.

## 6Conclusion

Summary\.In this paper we use CIDs to formalise ELK: the problem of Eliciting Latent Knowledge from AI systems\. We prove that sufficiently capable agents with access to enough information are honest exactly when they are truthful—meaning that honesty can be incentivised by correctly evaluating answers\. However, we also prove the impossibility result that: No training strategy for ELK that is indifferent between robustly capable agents produces an honest agent with certainty, even if evaluations are always correct during training\.

Limitations\.We make several limiting assumptions: First, we restrict our discussion to questions about the values of variables, but more general questions may be of interest, e\.g\., about the causal structure of the world\. Second, we explicitly assume that the developers and the agent agree on the set of variables in the environment, and so do not deal with problems of ontology mismatch\[[7](https://arxiv.org/html/2606.12268#bib.bib7)\]\. Relatedly, we assume that questions unambiguously refer to a variable in a shared language between the developers and agent\. These problems may pose difficulties for even defining ELK for advanced AI agents which may have substantially different conceptualisations of the world—what would it mean to honestly answer a question about a variable which does not map to anything in your world model? More practically, we focus on formalising the ELK problem statement, and highlighting challenges faced for solutions to ELK\. We do not offer any solutions\.

Future work\.Further work could relax any of these assumptions, for instance, to extend our formalism to the incomplete\-information setting wherein agents may have different subjective models, thereby tackling problems of ontology mismatch and reference\[[9](https://arxiv.org/html/2606.12268#bib.bib9)\]\. In addition, we hope that our work inspires formal solution proposals for ELK\. Finally, we would be excited to see empirical work including: practical methods for training honest systems \(e\.g\., using interpretability methods to provide feedback\); benchmarks for testing empirical solution proposals \(similar toRoger et al\. \[[29](https://arxiv.org/html/2606.12268#bib.bib29)\]\); model organisms with hidden knowledge for stress\-testing elicitation methods\[[22](https://arxiv.org/html/2606.12268#bib.bib22)\]; studies into the generalisation of honest behaviour, specifically easy\-to\-hard generalisation\.

Overall, we hope that this work conceptually clarifies ELK and its core challenges, enabling the design and development of more honest AI systems\.

## References

- Bellot et al\. \[2025\]Alexis Bellot, Jonathan Richens, and Tom Everitt\.The Limits of Predicting Agents from Behaviour, 2025\.URL[http://arxiv\.org/abs/2506\.02923](http://arxiv.org/abs/2506.02923)\.
- Bengio et al\. \[2025\]Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc\-Antoine Rondeau, Pierre\-Luc St\-Charles, and David Williams\-King\.Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?2025\.doi:10\.48550/arXiv\.2502\.15657\.URL[http://arxiv\.org/abs/2502\.15657](http://arxiv.org/abs/2502.15657)\.
- Brier \[1950\]Glenn W\. Brier\.Verification of forecasts expressed in terms of probability\.*Monthly Weather Review*, 78\(1\):1–3, 1950\.doi:10\.1175/1520\-0493\(1950\)078<0001:VOFEIT\>2\.0\.CO;2\.
- Burns et al\. \[2022\]Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt\.Discovering latent knowledge in language models without supervision, 2022\.
- Ceriscioli and Mohan \[2025\]Matteo Ceriscioli and Karthika Mohan\.Agents robust to distribution shifts learn causal world models even under mediation\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://neurips\.cc/virtual/2025/loc/san\-diego/poster/118687](https://neurips.cc/virtual/2025/loc/san-diego/poster/118687)\.
- Chalmers \[2025\]David J\. Chalmers\.Propositional interpretability in artificial intelligence, 2025\.URL[https://arxiv\.org/abs/2501\.15740](https://arxiv.org/abs/2501.15740)\.
- Christiano et al\. \[2021\]Paul Christiano, Ajeya Cotra, and Mark Xu\.Eliciting latent knowledge: How to tell if your eyes deceive you, 2021\.URL[https://docs\.google\.com/document/d/1WwsnJQstPq91\_Yh\-Ch2XRL8H\_EpsnjrC1dwZXR37PC8](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8)\.
- Everitt et al\. \[2021\]Tom Everitt, Ryan Carey, Eric D\. Langlois, Pedro A\. Ortega, and Shane Legg\.Agent incentives: A causal perspective\.In*Thirty\-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty\-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2\-9, 2021*, pages 11487–11495\. AAAI Press, 2021\.URL[https://ojs\.aaai\.org/index\.php/AAAI/article/view/17368](https://ojs.aaai.org/index.php/AAAI/article/view/17368)\.
- Foxabbott et al\. \[2025\]Jack Foxabbott, Rohan Subramani, and Francis Rhys Ward\.Higher\-order belief in incomplete information MAIDs, 2025\.URL[https://arxiv\.org/abs/2503\.06323](https://arxiv.org/abs/2503.06323)\.
- Gneiting \[2011\]Tilmann Gneiting\.Making and evaluating point forecasts\.*Journal of the American Statistical Association*, 106\(494\):746–762, 2011\.doi:10\.1198/jasa\.2011\.r10138\.
- Gneiting and Raftery \[2007\]Tilmann Gneiting and Adrian E\. Raftery\.Strictly proper scoring rules, prediction, and estimation\.*Journal of the American Statistical Association*, 102\(477\):359–378, 2007\.doi:10\.1198/016214506000001437\.
- Grosser \[1962\]Morton Grosser\.*The Discovery of Neptune*\.Harvard University Press, Cambridge, MA, 1962\.
- Heinrich \[2014\]Claudio Heinrich\.The mode functional is not elicitable\.*Biometrika*, 101\(1\):245–251, 2014\.doi:10\.1093/biomet/ast048\.
- Herrmann and Levinstein \[2024\]Daniel A\. Herrmann and Benjamin A\. Levinstein\.Standards for belief representations in LLMs, 2024\.URL[https://arxiv\.org/abs/2405\.21030](https://arxiv.org/abs/2405.21030)\.
- Karni \[2009\]Edi Karni\.A mechanism for eliciting probabilities\.*Econometrica*, 77\(2\):603–606, 2009\.doi:10\.3982/ECTA7833\.
- Karvonen et al\. \[2025\]Adam Karvonen, James Chua, Clément Dumas, Kit Fraser\-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, and Samuel Marks\.Activation oracles: Training and evaluating LLMs as general\-purpose activation explainers, 2025\.URL[https://arxiv\.org/abs/2512\.15674](https://arxiv.org/abs/2512.15674)\.
- Lambert et al\. \[2008\]Nicolas S\. Lambert, David M\. Pennock, and Yoav Shoham\.Eliciting properties of probability distributions\.In*Proceedings of the 9th ACM Conference on Electronic Commerce \(EC’08\)*, pages 129–138\. ACM, 2008\.doi:10\.1145/1386790\.1386813\.
- Langosco et al\. \[2023\]Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger\.Goal misgeneralization in deep reinforcement learning, 2023\.
- Levinstein and Herrmann \[2023\]B\. A\. Levinstein and Daniel A\. Herrmann\.Still no lie detector for language models: Probing empirical and conceptual roadblocks, 2023\.
- MacDermott et al\. \[2024\]Matt MacDermott, James Fox, Francesco Belardinelli, and Tom Everitt\.Measuring goal\-directedness, 2024\.URL[https://arxiv\.org/abs/2412\.04758](https://arxiv.org/abs/2412.04758)\.
- Mahon \[2016\]James Edwin Mahon\.The Definition of Lying and Deception\.In Edward N\. Zalta, editor,*The Stanford Encyclopedia of Philosophy*\. Metaphysics Research Lab, Stanford University, Winter 2016 edition, 2016\.
- Mallen et al\. \[2024\]Alex Mallen, Madeline Brumley, Julia Kharchenko, and Nora Belrose\.Eliciting latent knowledge from quirky language models, 2024\.URL[https://arxiv\.org/abs/2312\.01037](https://arxiv.org/abs/2312.01037)\.
- McGrath and Frank \[2023\]Matthew McGrath and Devin Frank\.Propositions\.In Edward N\. Zalta and Uri Nodelman, editors,*The Stanford Encyclopedia of Philosophy*\. Metaphysics Research Lab, Stanford University, Winter 2023 edition, 2023\.
- Myerson \[1981\]Roger B\. Myerson\.Optimal auction design\.*Mathematics of Operations Research*, 6\(1\):58–73, 1981\.ISSN 0364765X, 15265471\.URL[http://www\.jstor\.org/stable/3689266](http://www.jstor.org/stable/3689266)\.
- Pearl \[2009\]Judea Pearl\.*Causality*\.Cambridge university press, 2009\.
- Prelec \[2004\]Dražen Prelec\.A Bayesian truth serum for subjective data\.*Science*, 306\(5695\):462–466, 2004\.doi:10\.1126/science\.1102081\.
- Richens and Everitt \[2024\]Jonathan Richens and Tom Everitt\.Robust agents learn causal world models\.In*International Conference on Learning Representations*, volume 2024, pages 15786–15817, 2024\.URL[https://proceedings\.iclr\.cc/paper\_files/paper/2024/file/44a2b9f7bf9aec3f1fa333ad875b0ee0\-Paper\-Conference\.pdf](https://proceedings.iclr.cc/paper_files/paper/2024/file/44a2b9f7bf9aec3f1fa333ad875b0ee0-Paper-Conference.pdf)\.
- Richens et al\. \[2025\]Jonathan Richens, David Abel, Alexis Bellot, and Tom Everitt\.General agents contain world models, 2025\.URL[http://arxiv\.org/abs/2506\.01622](http://arxiv.org/abs/2506.01622)\.
- Roger et al\. \[2023\]Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, and Nate Thomas\.Benchmarks for detecting measurement tampering, 2023\.URL[https://arxiv\.org/abs/2308\.15605](https://arxiv.org/abs/2308.15605)\.
- Savage \[1971\]Leonard J\. Savage\.Elicitation of personal probabilities and expectations\.*Journal of the American Statistical Association*, 66\(336\):783–801, 1971\.doi:10\.1080/01621459\.1971\.10482346\.
- Schlosser \[2019\]Markus Schlosser\.Agency\.In Edward N\. Zalta, editor,*The Stanford Encyclopedia of Philosophy*\. Metaphysics Research Lab, Stanford University, Winter 2019 edition, 2019\.
- Schwitzgebel \[2021\]Eric Schwitzgebel\.Belief\.In Edward N\. Zalta, editor,*The Stanford Encyclopedia of Philosophy*\. Metaphysics Research Lab, Stanford University, Winter 2021 edition, 2021\.
- Schwitzgebel \[2024\]Eric Schwitzgebel\.How We Will Decide that Large Language Models Have Beliefs, July 2024\.URL[http://schwitzsplinters\.blogspot\.com/2023/11/how\-we\-will\-decide\-that\-large\-language\.html](http://schwitzsplinters.blogspot.com/2023/11/how-we-will-decide-that-large-language.html)\.\[Online; accessed 15\. Jul\. 2024\]\.
- Shah et al\. \[2022\]Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton\.Goal misgeneralization: Why correct specifications aren’t enough for correct goals, 2022\.
- Shanahan \[2022\]Murray Shanahan\.Talking about large language models, 2022\.URL[https://arxiv\.org/abs/2212\.03551](https://arxiv.org/abs/2212.03551)\.
- Vickrey \[1961\]William Vickrey\.Counterspeculation, auctions, and competitive sealed tenders\.*The Journal of Finance*, 16\(1\):8–37, 1961\.doi:https://doi\.org/10\.1111/j\.1540\-6261\.1961\.tb02789\.x\.URL[https://onlinelibrary\.wiley\.com/doi/abs/10\.1111/j\.1540\-6261\.1961\.tb02789\.x](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-6261.1961.tb02789.x)\.
- Ward et al\. \[2023\]Francis Rhys Ward, Francesco Belardinelli, Francesca Toni, and Tom Everitt\.Honesty is the best policy: Defining and mitigating ai deception\.In*NeurIPS 2023*, 2023\.
- Ward et al\. \[2024\]Francis Rhys Ward, Matt MacDermott, Francesco Belardinelli, Francesca Toni, and Tom Everitt\.The reasons that agents act: Intention and instrumental goals\.In*Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems*, AAMAS ’24\. International Foundation for Autonomous Agents and Multiagent Systems, 2024\.

## 7Appendix

### 7\.1Relationship of ELK with credence elicitation literature

The literature on proper scoring rules is, in some ways, closely related to ELK\.999We thank \[anonymised for review\] for pressing us to comment on this\!This literature has developed a detailed picture on the shape a reward function must take to ensure that a forecaster maximizes it by issuing the probabilistic forecast corresponding to their actual credences, with the intention of thereby encouraging the forecaster to carefully assess and to be honest\[see e\.g\.[3](https://arxiv.org/html/2606.12268#bib.bib3),[30](https://arxiv.org/html/2606.12268#bib.bib30),[11](https://arxiv.org/html/2606.12268#bib.bib11)\]\.

Likewise related is some of the broader literature on mechanism design, which has found ways to incentivise agents to reveal privately held information\[[24](https://arxiv.org/html/2606.12268#bib.bib24),[36](https://arxiv.org/html/2606.12268#bib.bib36)\]\. Think e\.g\. of Vickrey auctions, in which the optimal strategy is to bid one’s true \(private\) valuation of the object being auctioned\. See also\[[15](https://arxiv.org/html/2606.12268#bib.bib15),[26](https://arxiv.org/html/2606.12268#bib.bib26)\]for further mechanisms designed to elicit subjective probabilities in particular\.

ELK differs from mechanism design as follows: whereas mechanism design may investigate how to incentivise an agent with a given and fixed utility function to reveal some privately held information in a concrete situation, by structuring the latter in such a way that revealing the desired information maximises the reward the agent can expectin that situation, ELK is about designing atraining strategythatshapesan agent’s utility function in the appropriate way: Namely such that itgeneralisesto honest behaviour\. That is, honest behaviour should be exhibited even in situations where the agent does not receive another form of reward for an honest answer \(think, for example, of an LLM during deployment, where an honest or dishonest answer does not directly lead to reward or punishment\)\. This is in contrast to the setting of mechanism design, where, for example in an auction, the auctioneer needs to actuallygive the objectto the auction winner\. It is possible to learn an agent’s true valuation of an object by having him bid on it in a sealed\-bid second\-price \(“Vickrey”\) auction\. But such a mechanism would not constitute a solution to ELK, since ELK is concerned with training an agent in such a way that the human interacting with it can justaskit and expect an honest answer\.

ELK, as we define it, also differs from the problem addressed by strictly proper scoring rules in another way: Proper scoring rules take as an input both the true outcome of an event, and an entire probability distribution over all possible outcomes \(i\.e\., the agent’s entire credence distribution\)\. That is, the scenario where they can be applied is one where the agent reports an entire probability distribution—whereas ELK is concerned with agents who don’t reportcredencesat all, but instead make an assertion of the formV=vV=v\(e\.g\. their responses have the form“The sun is shining\!”and not “P\(Sunshine\)=0\.6,P\(Rain\)=0\.2,P\(Fog\)=0\.2P\(\\text\{Sunshine\}\)=0\.6,P\(\\text\{Rain\}\)=0\.2,P\(\\text\{Fog\)\}=0\.2”\)\. They assert a single value which they think the variable under question is most likely to take\. We focus on honest assertion of facts, rather than honest reporting of credences, both because it is conceptually clearer, and because it is more directly relevant in the context of current frontier AI systems such as LLMs, which make ordinary language assertions of facts in conversations with humans\.

There has been some attention in the proper scoring rules literature as well to the question of whether analogous rules exist for eliciting specific properties of an agent’s credence function, such as the mean of the distribution or specific quantiles\[[17](https://arxiv.org/html/2606.12268#bib.bib17),[10](https://arxiv.org/html/2606.12268#bib.bib10)\]\. For the mode specifically \(which would correspond to what we call an agent’s honest response; see also[Section˜7\.6](https://arxiv.org/html/2606.12268#S7.SS6)\),Heinrich \[[13](https://arxiv.org/html/2606.12268#bib.bib13)\]has shown that no such rule exists\.

### 7\.2Formal definitions

###### Definition 10\(Bayesian Network \(BN\)\[[25](https://arxiv.org/html/2606.12268#bib.bib25)\]\)\.

A Bayesian networkℳ=\(𝒢,P​r\)\\mathcal\{M\}=\(\\mathcal\{G\},Pr\)over a set of \(discrete\) random variables𝑽=\{V1,…,Vn\}\\bm\{V\}=\\\{V\_\{1\},\\ldots,V\_\{n\}\\\}consists of a DAG𝒢\\mathcal\{G\}and a joint probability distributionP​rPr, s\.t\. the distribution isMarkov\-compatiblewith the graph𝒢\\mathcal\{G\}, i\.e\., Pr\(𝑽=𝒗\)=Πi=1n​Pr​\(Vi=vi\|PaVi\)\(\\bm\{V\}=\\bm\{v\}\)=\\Pi\_\{i=1\}^\{n\}\\text\{Pr\}\(V\_\{i\}=v\_\{i\}\|\\text\{\{Pa\}\}^\{V\_\{i\}\}\)\. Equivalently, the distribution over any variable is conditionally independent of its non\-descendants given its parents\.

We often denote a networkℳ\\mathcal\{M\}’s probability distribution withP​rℳPr\_\{\\mathcal\{M\}\}\.

The variables𝑽\\bm\{V\}correspond exactly with the nodes of the graph𝒢\\mathcal\{G\}\. We thus refer to them using either“variable”or“node”\.

We use𝐏𝐚V\\mathbf\{Pa\}^\{V\}to refer to the parents of variableVVin𝒢\\mathcal\{G\}\.

Where the meaning is clear, we will sometimes writeP​r​\(y∣v\)Pr\(y\\mid v\)to denoteP​r​\(Y=y∣V=v\)Pr\(Y=y\\mid V=v\)\.

###### Definition 11\(Intervention / Distribution\[[25](https://arxiv.org/html/2606.12268#bib.bib25)\]\)\.

A \(soft\)*intervention*/*distribution*on Networkℳ\\mathcal\{M\}is a partial distributionσ\\sigmaover variables𝒀⊆𝑽\\bm\{Y\}\\subseteq\\bm\{V\}which replaces each conditional probability distribution \(CPD\) Pr\(Y\|PaY\)\(Y\|\\text\{Pa\}^\{Y\}\)with a new CPDσ​\(Y\|Pa∗Y\)\\sigma\(Y\|\\text\{Pa\}\_\{\*\}^\{Y\}\)for eachY∈𝒀Y\\in\\bm\{Y\}, where PaY∗\{\}\_\{\*\}^\{Y\}may differ from PaY\. Any interventionσ\\sigmaon the set of variables𝒀\\bm\{Y\}leads to a new joint distribution: Pr\(𝑽=𝒗\)σ:=∏Y∈𝒀σ\(y∣pa∗Y\)⋅∏V∈𝑽∖𝒀Pr\(v∣paV\)\{\}\_\{\\sigma\}\(\\bm\{V\}=\\bm\{v\}\)\\mathrel\{:=\}\\prod\_\{Y\\in\\bm\{Y\}\}\\sigma\(y\\mid\\text\{pa\}^\{Y\}\_\{\*\}\)\\cdot\\prod\_\{V\\in\\bm\{V\}\\setminus\\bm\{Y\}\}\\text\{Pr\}\(v\\mid\\text\{pa\}^\{V\}\)\. We denote the set of all interventions onℳ\\mathcal\{M\}with𝑰ℳ\\bm\{I\}\_\{\\mathcal\{M\}\}

###### Definition 12\(Causal influence diagram \(CID\)\[[8](https://arxiv.org/html/2606.12268#bib.bib8)\]\)\.

A CID is a BN in which the variables𝑽\\bm\{V\}are partitioned into decision𝑫\\bm\{D\}, chance𝑿\\bm\{X\}, and utility variables𝑼\\bm\{U\}\. Instead of a full joint distribution over𝑽\\bm\{V\},P​rPrspecifies the CPDs for each*non\-decision*variableV∈𝑽∖𝑫V\\in\\bm\{V\}\\setminus\\bm\{D\}\.

###### Definition 13\(Policy\)\.

In a CIDℳ\\mathcal\{M\}, an agent’s*policy*π\\pispecifies the CPDs for the agent’s decisionsπ​\(D∣PaD\)\\pi\(D\\mid\\text\{Pa\}^\{D\}\)for eachD∈𝑫D\\in\\bm\{D\}\. The partial distributionP​rPralong with a policyπ\\piresults in a full joint distributionP​rℳπPr\_\{\\mathcal\{M\}\_\{\\pi\}\}over𝑽\\bm\{V\}\(the expected value of variableVVaccording to this distribution will be denoted by𝔼π​\[V\]\\mathbb\{E\}\_\{\\pi\}\[V\]\)\.

For any variableVVwhich is not a descendant of a decision node,P​rℳπ​\(V=v∣𝐏𝐚V\)Pr\_\{\\mathcal\{M\}\_\{\\pi\}\}\(V=v\\mid\\mathbf\{Pa\}^\{V\}\)will be the same for all policiesπ\\pi\. We will thus use the notationP​rℳ​\(V=v∣𝐏𝐚V\)Pr\_\{\\mathcal\{M\}\}\(V=v\\mid\\mathbf\{Pa\}^\{V\}\)in such cases\.

###### Definition 14\(Optimality\)\.

A policyπ∗\\pi^\{\*\}is*optimal*if it maximises expected utility,π∗:=arg⁡maxπ⁡𝔼π​\[U\]\\pi^\{\*\}:=\\arg\\,\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\[U\]\.

### 7\.3Notable assumptions and restrictions of theorems

- •The theorems relying on the results of\[[27](https://arxiv.org/html/2606.12268#bib.bib27)\]\(specifically,[Theorems˜4\.1](https://arxiv.org/html/2606.12268#S4.Thmtheorem1),[7\.1](https://arxiv.org/html/2606.12268#S7.Thmtheorem1)and[5\.1](https://arxiv.org/html/2606.12268#S5.Thmtheorem1)\) inherit the assumptions and restrictions of the latter’s results\. That is, in particular: - –They hold for*almost all*single\-decision, single\-utility CIDsℳ∗\\mathcal\{M\}^\{\*\}satisfying - –*domain dependence*\([Definition˜15](https://arxiv.org/html/2606.12268#Thmdefinition15)\) and - –*unmediated decision task*\([Definition˜16](https://arxiv.org/html/2606.12268#Thmdefinition16)\)
- •[Theorem˜7\.2](https://arxiv.org/html/2606.12268#S7.Thmtheorem2), and hence[Theorem˜4\.1](https://arxiv.org/html/2606.12268#S4.Thmtheorem1), is restricted to values of𝐏𝐚D\\mathbf\{Pa\}^\{D\}which have positive probability, that is, s\.t\.P​rℳ∗​\(𝐏𝐚D=𝐩𝐚D\)\>0Pr\_\{\\mathcal\{M\}^\{\*\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\>0\.

###### Definition 15\(Domain Dependent Decision Task\)\.

A single\-decision, single\-utility CIDℳ\\mathcal\{M\}with chance variables𝑪\\bm\{C\},ℳ\\mathcal\{M\}exhibits*domain dependence*if there existsP​\(𝑪=𝒄\)P\(\\bm\{C\}=\\bm\{c\}\)andP′​\(𝑪=𝒄\)P^\{\\prime\}\(\\bm\{C\}=\\bm\{c\}\)compatible withℳ\\mathcal\{M\}\(i\.e\. brought about by local interventions \(seeRichens and Everitt \[[27](https://arxiv.org/html/2606.12268#bib.bib27)\]\) on the chance variables\) such that

π∗=arg⁡maxπ⁡𝔼πP​\[U\]\\displaystyle\\pi^\{\*\}=\\arg\\,\\max\_\{\\pi\}\\mathbb\{E\}^\{P\}\_\{\\pi\}\[U\]implies

π∗≠arg⁡maxπ⁡𝔼πP′​\[U\]\.\\displaystyle\\pi^\{\*\}\\neq\\arg\\,\\max\_\{\\pi\}\\mathbb\{E\}^\{P^\{\\prime\}\}\_\{\\pi\}\[U\]\.

###### Definition 16\(Unmediated Decision Task\)\.

A single\-decision, single\-utility CIDℳ\\mathcal\{M\}presents an*unmediated decision task*if𝑫​𝒆​𝒔​𝒄D∩𝑨​𝒏​𝒄U=∅\\bm\{Desc\}\_\{D\}\\cap\\bm\{Anc\}\_\{U\}=\\emptyset, where𝑨​𝒏​𝒄U\\bm\{Anc\}\_\{U\}denotes the set of ancestors of nodeUUinℳ\\mathcal\{M\}’s graph and𝑫​𝒆​𝒔​𝒄D\\bm\{Desc\}\_\{D\}denotes the descendants ofDD\.

\(for both definitions, cf\.Richens and Everitt \[[27](https://arxiv.org/html/2606.12268#bib.bib27)\]\.\)

### 7\.4Fully formal definition of a training strategy

What a training strategy ultimately produces is an*agent*:

###### Definition 17\(Agent\)\.

An agentΓ\\Gammafor a CIDℳ\\mathcal\{M\}is a policy oracle, that is, a map from the set of distributions onℳ\\mathcal\{M\}to the set of policies onℳ\\mathcal\{M\}

Γ:𝑰ℳ→Πℳ\\displaystyle\\Gamma:\\bm\{I\}\_\{\\mathcal\{M\}\}\\rightarrow\\Pi\_\{\\mathcal\{M\}\}

It does this in a scenario that we call a*training problem*

###### Definition 18\(Training Problem\)\.

A training problem is a tuple\(ℳ,𝑶,𝒮\)\(\\mathcal\{M\},\\bm\{O\},\\mathcal\{S\}\)whereℳ\\mathcal\{M\}is a CID representing the environment,𝑶\\bm\{O\}is the set of variables inℳ\\mathcal\{M\}which are observable to the developers, and𝒮\\mathcal\{S\}is the set of interventions onℳ\\mathcal\{M\}available to them\.

Since the decision nodeDDwill here represent the agent’s answer to a question \(which, during training, is always a developers’ question\), it is assumed thatD∈𝑶D\\in\\bm\{O\}\.

For a given training problem, a*training strategy*consists of a sampling procedure and a training algorithm:

###### Definition 19\(Sampling Procedure\)\.

A sampling procedure builds a dataset𝒟\\mathcal\{D\}by selecting interventions𝝈\\bm\{\\sigma\}from𝒮\\mathcal\{S\}and policiesπ\\pifor the agent, and sampling𝑶\\bm\{O\}andUUfromℳ𝝈,π\\mathcal\{M\}\_\{\\bm\{\\sigma\},\\pi\}\. The rows of the dataset are thus of the form\(𝝈,π,o,u\)\(\\bm\{\\sigma\},\\pi,o,u\)\.

###### Definition 20\(Training Algorithm\)\.

A training algorithmAAfor a training problem𝒫=\(ℳ,𝑶,𝒮\)\\mathcal\{P\}=\(\\mathcal\{M\},\\bm\{O\},\\mathcal\{S\}\)is a tuple\(D,α\)\(D,\\alpha\), whereDDis a sampling procedure, outputting a dataset𝒟\\mathcal\{D\}, andα\\alphais a map which takes an initialised agentΓ0\\Gamma\_\{0\}and a dataset𝒟\\mathcal\{D\}and outputs an agentΓ\\Gamma\.

Putting it all together:

###### Definition 21\(Training Strategy\)\.

A training strategy𝒯\\mathcal\{T\}for a training problem𝒫=\(ℳ,𝑶,𝒮\)\\mathcal\{P\}=\(\\mathcal\{M\},\\bm\{O\},\\mathcal\{S\}\)is a tuple\(U,A\)\(U,A\), whereUUis a function whose arguments are a subset of𝑶\\bm\{O\}, to serve as the utility function ofℳ\\mathcal\{M\}’s utility node during training, andAAis a training algorithm\.

### 7\.5The philosophy of AI beliefs

The most common philosophical account of belief is that it is a propositional attitude, i\.e\., a mental state expressing some attitude towards the truth of a proposition\[[32](https://arxiv.org/html/2606.12268#bib.bib32),[23](https://arxiv.org/html/2606.12268#bib.bib23),[6](https://arxiv.org/html/2606.12268#bib.bib6)\]\. Different philosophical theories of belief interpret this in various ways, for instance, representationalist views focus on the internal mental representations instantiating beliefs\[[32](https://arxiv.org/html/2606.12268#bib.bib32),[14](https://arxiv.org/html/2606.12268#bib.bib14)\], whereas dispositionalist theories define belief in terms of its correspondence to behaviour\[[32](https://arxiv.org/html/2606.12268#bib.bib32),[33](https://arxiv.org/html/2606.12268#bib.bib33)\]\.

Belief is an important and contentious concept in AI\. It is important because it underlies many other ideas we care about, such as deception\[[21](https://arxiv.org/html/2606.12268#bib.bib21)\], intention\[[38](https://arxiv.org/html/2606.12268#bib.bib38)\], interpretability\[[6](https://arxiv.org/html/2606.12268#bib.bib6),[4](https://arxiv.org/html/2606.12268#bib.bib4)\], and agency\[[31](https://arxiv.org/html/2606.12268#bib.bib31)\]\. However, there is no universally accepted theory of belief, and ascribing belief to artificial agents is controversial—potentially risking anthropomorphisation\[[19](https://arxiv.org/html/2606.12268#bib.bib19),[35](https://arxiv.org/html/2606.12268#bib.bib35)\]\.

Many philosophical theories of belief would admit the ascription of beliefs to certain kinds of AI systems\[[14](https://arxiv.org/html/2606.12268#bib.bib14)\]\. We represent an AI’s subjective beliefs as a causal model \(a CID; cf\.[Section˜7\.2](https://arxiv.org/html/2606.12268#S7.SS2)\)\. Robustly capable AI agents can be understood as internally representing the world, either implicitly or explicitly\[[27](https://arxiv.org/html/2606.12268#bib.bib27),[28](https://arxiv.org/html/2606.12268#bib.bib28),[38](https://arxiv.org/html/2606.12268#bib.bib38)\]\. We try to take a non\-controversial stance towards ascribing beliefs and knowledge to AI systems, using common\-sense and intuitive concepts in a precise way, but without being committed to contentious ascription of “mind" or “consciousness" to AI\.

### 7\.6Defining honesty

If an agent is asked about the value ofVV, and assigns positive probability to more than one potential value ofVV, they can interpret that question in different ways; these include

- •“Is there a valuevvofVVsuch that youbelieve‘V=v’to be the case?”
- •“What is your credence in theV=vV=vfor some \(or all\) potential value\(s\) ofVV?”
- •\(Where the domain ofVVallows for affine combinations\)“What value do you expectVVto take? What is its expected value according to your credences?”
- •"What is your best guess for the value ofVV?"

Here, we assume the last of these interpretations\. For the others, the definition of honesty would have to be adapted correspondingly\.

### 7\.7More on truthfulness and honesty

This section contains an expanded version of[Section˜4](https://arxiv.org/html/2606.12268#S4), presenting a deeper dive into the relationship between truthfulness and honesty, starting with the factors that can lead to them coming apart\. One such factor is the kind of information the agent has access to, as in the following example:

DDXXYY\(a\)The true CIDℳ∗\\mathcal\{M\}^\{\*\}\.DDXXYY\(b\)The referee’s CIDℳA\\mathcal\{M\}^\{A\}\.
Figure 3:Honest mistakes \([Example˜5](https://arxiv.org/html/2606.12268#Thmexample5)\)\.The referee has to decide \(DD\) whether a player is offside \(YY\) based on reports from the linesman \(XX\)\. In the true CID, the linesman do their best to report whether the player is offside, but they sometimes make mistakes, misleading even a capable referee\. A suspicious referee does not trust the linesman’s reports—they have an incorrect CID, in which the reports do not depend on whether the player is offside\.###### Example 5\(Honest mistakes \([Fig\.˜3](https://arxiv.org/html/2606.12268#S7.F3)\)\)\.

A football referee must call whether a player is offside or not, represented by the latent variableYY\. The referee must rely on observations from the linesmen \(XX\), who are 99% accurate\. Suppose the linesman say that the player is offside, and the referee reports this call, but that they are mistaken\. The referee was honest, but not truthful\.

Alternatively, an agent might be wrong about the causal structure of the environment\. That is, the agent’s subjective model may not correspond to the objective model \(e\.g\., because the agent’s training was insufficient for it to learn the underlying causal dynamics\):

###### Example[5](https://arxiv.org/html/2606.12268#Thmexample5)\(continued\)\.

Now assume that another referee thinks that his linesmen are bribed: that their calls don’t reflect what they observed, but only which team they want to win\. This model of the situation is represented asℳA\\mathcal\{M\}^\{A\}in[Fig\.˜3](https://arxiv.org/html/2606.12268#S7.F3)\. But suppose that in reality \(ℳ∗\\mathcal\{M\}^\{\*\}\), the linesmen are not only trustworthy, but are in fact evenbetterthan the first referee’s: Their reports are always correct\. An on side goal is scored, and the linesmen accurately report this; but the referee discounts their report and rules it off side\. Since he really believes in the existence of the bribe, he is being honest; still, his call is not truthful\.

In the first case, the referee was simply not in a position to know \(or guess\) the truth\. Nobody in his position could have done any better\. In the second case, what drove apart truthfulness and honesty was a failure of his own capability: His paranoia, reflected in his inaccurate subjective model of the worldℳA\\mathcal\{M\}^\{A\}\([Fig\.˜3\(b\)](https://arxiv.org/html/2606.12268#S7.F3.sf2)\), prevented him from knowing a truth, even though all the relevant information was available\.

The next section will discuss the exact way in which these two factors can come between truthfulness and honesty, and conditions under which truthfulness and honesty do coincide\.

#### 7\.7\.1For capable agents with enough information truthfulness and honesty coincide

In this section, we continue to explore the relationship between truthfulness and honesty, leading up to[Corollary˜1](https://arxiv.org/html/2606.12268#Thmcorollary1), which states that for the kinds of agents we most care to elicit the latent knowledge of—robustly capable ones—the honest policies are exactly the truthful ones\.

We make use of[Richens and Everitt](https://arxiv.org/html/2606.12268#bib.bib27)’s core theorem[7\.1](https://arxiv.org/html/2606.12268#S7.Thmtheorem1)\.

###### Theorem 7\.1\(Richens and Everitt \[[27](https://arxiv.org/html/2606.12268#bib.bib27)\]Theorem 2\)\.

For almost all environmentsℳ∗\\mathcal\{M\}^\{\*\}, if the agent’s utility suitably depends on the world: As the agent approaches full robust capability inℳ∗\\mathcal\{M\}^\{\*\}, their subjective representation of their environment inℳA\\mathcal\{M\}^\{A\}approximates the submodel ofℳ∗\\mathcal\{M\}^\{\*\}consisting of chance nodes arbitrarily closely\.

It holds for*unmediated decision settings*, wherein the agent’s decision does not influence its environment beyond its utility \([Definitions˜16](https://arxiv.org/html/2606.12268#Thmdefinition16)and[15](https://arxiv.org/html/2606.12268#Thmdefinition15)\)\.Ceriscioli and Mohan \[[5](https://arxiv.org/html/2606.12268#bib.bib5)\]provide a partial generalisation to the mediated case\.

Even robustly capable agents may have access only to limited information\. For truthfulness and honesty to coincide, we must rule out a scenario like the first referee in[Example˜5](https://arxiv.org/html/2606.12268#Thmexample5)\. The required property is that the agent must be \(objectively\) in a position to guess the true value of the variable\. Intuitively, an agent is in a position to correctly guess the value of a variable if the information they observe is sufficient to uniquely identify the true value of the variable\.101010As we show in[Section7\.8](https://arxiv.org/html/2606.12268#S7.SS8), this is equivalent to“being in a position to know the value ofYY”: For every value of𝐏𝐚D\\mathbf\{Pa\}^\{D\}, there is a unique value ofYYto whichℳ∗\\mathcal\{M\}^\{\*\}assigns probability 1\.

###### Definition 22\(Being in a position to correctly guess\)\.

An agent isin a position to correctly guessthe value of the variableYYat decision nodeDDif: For every value of𝐏𝐚D\\mathbf\{Pa\}^\{D\}there is a unique valueY=yY=ysuch thatyyis the objectively \(i\.e\., inℳ∗\\mathcal\{M\}^\{\*\}\) themost likely value ofYY; and this most likely value is almost always the true value ofYY\.

###### Example[5](https://arxiv.org/html/2606.12268#Thmexample5)\(continued\)\.

The first referee, who trusts his mistaken linesmen, is not in a position to correctly guess whether the player is offside\. The referee simply does not always observe enough information, and their best guess is therefore incorrect\. This is the case even though their model of the world is accurate\. In contrast, the second referee, who has perfect linesmen, is in a position to correctly guess\. However, this referee makes mistakes due to a lack of capability\.

Being in a position to correctly guess and robust capability, taken together, bridge the gap between honesty and truthfulness\.

From[Theorem˜7\.1](https://arxiv.org/html/2606.12268#S7.Thmtheorem1)we know that arbitrarily capable agents have arbitrarily good models of the world\. If an agent is in a position to correctly guess, thenℳ∗\\mathcal\{M\}^\{\*\}will \(conditioning on their observations\) assign probability 1 to the correct hypothesis \([Lemma˜2](https://arxiv.org/html/2606.12268#Thmlemma2)\); so asℳA\\mathcal\{M\}^\{A\}approximatesℳ∗\\mathcal\{M\}^\{\*\}more and more closely, it will at some point assign\>50%\>50\\%probability to the true hypothesis as well\. At this capability threshold—at the latest—it will be guaranteed that it assigns higher probability to the true hypothesis than to any other hypothesis\. Therefore, the best guess will be true, and honestly reporting the best guess will be reporting the truth and vice versa\.

###### Theorem 7\.2\.

For almost all CIDsℳ∗\\mathcal\{M\}^\{\*\}with variables𝐕\\bm\{V\}which satisfy[Definitions˜16](https://arxiv.org/html/2606.12268#Thmdefinition16)and[15](https://arxiv.org/html/2606.12268#Thmdefinition15): If an agent is in a position to correctly guessYYatDD, then there will be a capability threshold such that if the agent clears it, their best guesses will always be correct\.

###### Corollary 1\.

Under these assumptions, the policies which are truthful aboutYYare exactly the policies which are honest aboutYY\.

\(For a full proof, see[Section˜7\.8](https://arxiv.org/html/2606.12268#S7.SS8)\)\.

#### 7\.7\.2Incentivising honesty via truthfulness

We have said that ELK is the problem of training a capable agent which honestly answers questions; we now have a formal definition of what it means for such an agent to be honest—but how do we incentivise this property?

Usually, the developers do not have precise, interpretable access to an AI system’s subjective beliefs\. Any training objective which depends only on the behaviour of the agent cannot directly reward exactly those answers which the agent subjectively believes are most likely to be true\.

However, for agents covered by[Corollary˜1](https://arxiv.org/html/2606.12268#Thmcorollary1), incentivising truthfulness is the same as incentivising honesty, i\.e\., if optimal policies are truthful then they are honest\. And in fact, this will be the case even in a much broader range of cases: If an agent is rewarded for telling the truth, then saying what it believes is most likely to be the truth will be the policy which maximises its expected utility\.

###### Example[5](https://arxiv.org/html/2606.12268#Thmexample5)\(continued\)\.

Consider the football referee who believes the linesman are trustworthy\. Assume all the referee’s decisions are checked by an infallible VAR post\-match, and they are rewarded for every call they got right\. Their optimal policy will be to always follow the linesmen—always say what, given their beliefs, is most likely\. That is, the optimal policy is honest \([Definition˜6](https://arxiv.org/html/2606.12268#Thmdefinition6)\), and the referee’s answers are honest, even when that answer happens to be wrong\. The“truthfulness”utility

U​\(Q,D\)=\{0,if​Q=“Y”∧Y≠y∧D=y,1,otherwise\.U\(Q,D\)=\\begin\{cases\}0,&\\text\{if \}Q=\\text\{\\ltxml@oqmark@open\\textquotedblleft\\penalty 10000\\thinspace\\thinspace Y\\textquotedblright\\ltxml@oqmark@close\{\}\}\\wedge Y\\neq y\\wedge D=y,\\\\ 1,&\\text\{otherwise\}\.\\end\{cases\}will, in fact, incentivise honesty\.

Of course, in reality, developers may not have access to an infallible truth\-telling mechanism\. If the developers who evaluate the agent’s answers are sometimes mistaken, then they may inadvertently incentivise dishonesty\. We can distinguish*systematic*, learnable, mistakes from*noisy*mistakes which cannot be predicted\. If the developers’ mistakes are noisy, then the agent may still be incentivised to be honest\. On the other hand, if the agent learns to model the mechanism that causes mistakes in the evaluation, then it may be incentivised to exploit this mechanism by being dishonest\.

### 7\.8Developing the formal theory of honesty and truthfulness

We will show that two conditions are jointly sufficient for honesty and truthfulness to coincide: The agent being in a position to correctly guessYYatDD, and the agent clearing a certain capability threshold; with the former corresponding to the agent having access to sufficient information, and the latter to the agent’s model of the causal structure of the world being accurate enough to make correct use of that information\.

We will first focus on guessability: This condition concerns the objective position of the agent in the world \(and is definable by exclusive reference to the objective model of the worldℳ∗\\mathcal\{M\}^\{\*\}\)\. We will first consider a \(supposedly\) stronger condition: Being in a position toknow/“knowability”, which we define as it being the case, in the objective model of the environment, that the agent’s observations concentrate all probability mass on the correct hypothesis \(so that any agent with access to the correct objective model can likewise concentrate all their subjective probabilities on the correct hypothesis\)\.

###### Definition 23\(Knowability\)\.

For a CIDℳ\\mathcal\{M\}with variables𝑽\\bm\{V\}, we say thatY∈𝑽Y\\in\\bm\{V\}is*knowable*atD∈𝑽D\\in\\bm\{V\}if for every value𝐩𝐚D\\mathbf\{pa\}^\{D\}of𝐩𝐚D\\mathbf\{pa\}^\{D\}s\.t\.P​rℳ​\(𝐏𝐚D=𝐩𝐚D\)\>0Pr\_\{\\mathcal\{M\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\>0, there is a unique valuey𝐩𝐚D∈d​o​m​\(Y\)y\_\{\\mathbf\{pa\}^\{D\}\}\\in dom\(Y\)s\.t\.

P​rℳ​\(Y=y𝐩𝐚D∣𝐏𝐚D=𝐩𝐚D\)=1\.\\displaystyle Pr\_\{\\mathcal\{M\}\}\(Y=y\_\{\\mathbf\{pa\}^\{D\}\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)=1\.

Perhaps knowability is too strong—after all, our definition of honesty only requires that the agent report the subjectivelymost likelyvalue ofYY, and not that this value necessarily has to be assigned probability 1\. Correspondingly, the agent’s honest answer will be truthful if the subjectively most likely value happens to be the correct one, even if the agent is not 100% sure of it\. We will thus define the notion of

###### Definition 24\(Guessability\)\.

For a CIDℳ\\mathcal\{M\}with variables𝑽\\bm\{V\}, we say thatY∈𝑽Y\\in\\bm\{V\}is*guessable*atD∈𝑽D\\in\\bm\{V\}if for all possible values𝐩𝐚D\\mathbf\{pa\}^\{D\}of𝐏𝐚D\\mathbf\{Pa\}^\{D\}s\.t\.P​rℳ​\(𝐏𝐚D=𝐩𝐚D\)\>0Pr\_\{\\mathcal\{M\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\>0, they

1. 1\.single out a unique valuey𝐩𝐚Dy\_\{\\mathbf\{pa\}^\{D\}\}as the most likely one: ∀y^∈d​o​m​\(Y\)\\\{y𝐩𝐚D\}:\\displaystyle\\forall\\hat\{y\}\\in dom\(Y\)\\backslash\\\{y\_\{\\mathbf\{pa\}^\{D\}\}\\\}:Prℳ\(Y=y𝐩𝐚D\\displaystyle Pr\_\{\\mathcal\{M\}\}\(Y=y\_\{\\mathbf\{pa\}^\{D\}\}∣𝐏𝐚D=𝐩𝐚D\)\>\\displaystyle\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\>P​rℳ​\(Y=y^∣𝐏𝐚D=𝐩𝐚D\)\\displaystyle Pr\_\{\\mathcal\{M\}\}\(Y=\\hat\{y\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)
2. 2\.that value is almost certainly the correct one: P​rℳ​\(𝐏𝐚D=𝐩𝐚D∧Y≠y𝐩𝐚D\)=0Pr\_\{\\mathcal\{M\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\\wedge Y\\neq y\_\{\\mathbf\{pa\}^\{D\}\}\)=0

Via the following lemma, we can see that these two approaches \(knowability and guessability\) are really two ways of describing the same property in a CID:

###### Lemma 2\.

Letℳ\\mathcal\{M\}be a CID with variables𝐕\\bm\{V\}\. ThenY∈𝐕Y\\in\\bm\{V\}is guessable at a decision nodeD∈𝐕D\\in\\bm\{V\}if and only ifYYis knowable atDD\.

###### Proof\.

\(⇒\)\(\\Rightarrow\)SupposeYYis guessable atDDand let𝐩𝐚D\\mathbf\{pa\}^\{D\}be an arbitrary value of𝐩𝐚D\\mathbf\{pa\}^\{D\}s\.t\.P​rℳ​\(𝐏𝐚D=𝐩𝐚D\)\>0Pr\_\{\\mathcal\{M\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\>0\. It suffices to show thatP​rℳ​\(Y=y𝐩𝐚D∣𝐏𝐚D=𝐩𝐚D\)=1Pr\_\{\\mathcal\{M\}\}\(Y=y\_\{\\mathbf\{pa\}^\{D\}\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)=1\. By the second condition for guessability \([2](https://arxiv.org/html/2606.12268#S7.I3.i2)\), we have:

P​rℳ​\(Y≠y𝐩𝐚D∣𝐏𝐚D=𝐩𝐚D\)\\displaystyle Pr\_\{\\mathcal\{M\}\}\(Y\\neq y\_\{\\mathbf\{pa\}^\{D\}\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)=P​rℳ​\(𝐏𝐚D=𝐩𝐚D∧Y≠y𝐩𝐚D\)P​rℳ​\(𝐏𝐚D=𝐩𝐚D\)\\displaystyle=\\frac\{Pr\_\{\\mathcal\{M\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\\wedge Y\\neq y\_\{\\mathbf\{pa\}^\{D\}\}\)\}\{Pr\_\{\\mathcal\{M\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\}=0\\displaystyle=0
from which it follows thatP​rℳ​\(Y=y𝐩𝐚D∣𝐏𝐚D=𝐩𝐚D\)=1Pr\_\{\\mathcal\{M\}\}\(Y=y\_\{\\mathbf\{pa\}^\{D\}\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)=1, so thatYYis knowable atDD\.

\(⇐\)\(\\Leftarrow\)SupposeYYis knowable atDD\. It suffices to show that both conditions for guessability are satisfied\.

By knowability,P​rℳ​\(Y=y𝐩𝐚D∣𝐏𝐚D=𝐩𝐚D\)=1Pr\_\{\\mathcal\{M\}\}\(Y=y\_\{\\mathbf\{pa\}^\{D\}\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)=1\. Because probabilities must sum to11, this means thatP​rℳ​\(Y=y^∣𝐏𝐚D=𝐩𝐚D\)=0Pr\_\{\\mathcal\{M\}\}\(Y=\\hat\{y\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)=0for all other valuesy^∈d​o​m​\(Y\)\\hat\{y\}\\in dom\(Y\)\. Since this makesy𝐩𝐚Dy\_\{\\mathbf\{pa\}^\{D\}\}the unique most likely value ofYYconditional on𝐏𝐚D=𝐩𝐚D\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}according toℳ\\mathcal\{M\}, the first condition for guessability is satisfied\.

Now assume for the sake of a contradiction thatP​rℳ​\(\(𝐏𝐚D=𝐩𝐚D\)∧\(Y≠y𝐩𝐚D\)\)=ϵ\>0Pr\_\{\\mathcal\{M\}\}\(\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\\wedge\(Y\\neq y\_\{\\mathbf\{pa\}^\{D\}\}\)\)=\\epsilon\>0\. BecauseP​r​\(A∣B\)≥P​r​\(A∧B\)Pr\(A\\mid B\)\\geq Pr\(A\\wedge B\)in full generality, it follows that

P​rℳ​\(Y≠y𝐩𝐚D∣𝐏𝐚D=𝐩𝐚D\)\\displaystyle Pr\_\{\\mathcal\{M\}\}\(Y\\neq y\_\{\\mathbf\{pa\}^\{D\}\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)≥P​rℳ​\(\(𝐏𝐚D=𝐩𝐚D\)∧\(Y≠y𝐩𝐚D\)\)\\displaystyle\\geq Pr\_\{\\mathcal\{M\}\}\(\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\\wedge\(Y\\neq y\_\{\\mathbf\{pa\}^\{D\}\}\)\)≥ϵ,\\displaystyle\\geq\\epsilon,and thus thatP​rℳ​\(Y≠y𝐩𝐚D∣𝐏𝐚D=𝐩𝐚D\)≥ϵ\>0Pr\_\{\\mathcal\{M\}\}\(Y\\neq y\_\{\\mathbf\{pa\}^\{D\}\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\\geq\\epsilon\>0\. But then

P​rℳ​\(Y=y𝐩𝐚D∣𝐏𝐚D=𝐩𝐚D\)=1−ϵ<1,Pr\_\{\\mathcal\{M\}\}\(Y=y\_\{\\mathbf\{pa\}^\{D\}\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)=1\-\\epsilon<1,a contradiction\. Thus the second condition for guessability is satisfied, completing the proof\.

∎

Moving on to truthfulness and honesty, we need to consider again the relationship between the objective modelℳ∗\\mathcal\{M\}^\{\*\}and the agent’s subjective modelℳA\\mathcal\{M\}^\{A\}\. Recall that we assume that the agent always accurately observes the true values of the parents of its decisions: The decision nodeDDalways appears in bothℳ∗\\mathcal\{M\}^\{\*\}andℳA\\mathcal\{M\}^\{A\}, the set of its parents is identical in both models, and if𝐏𝐚D=𝐩𝐚D\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}is the case in reality, the agent computes its posterior by correctly updating on it \(within its subjective model\), that is, the agent’s posterior isPrℳA\(⋅∣𝐏𝐚D=𝐩𝐚D\)Pr\_\{\{\\mathcal\{M\}^\{A\}\}\}\(\\cdot\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\.

We are now talking about the agent’spolicy\([Definition˜13](https://arxiv.org/html/2606.12268#Thmdefinition13)\) and the joint probability distributions we get onallvariables of a CID \(including decision nodes and their descendants\), denoted byP​rℳπ∗Pr\_\{\\mathcal\{M\}^\{\*\}\_\{\\pi\}\}andP​rℳπAPr\_\{\\mathcal\{M\}^\{A\}\_\{\\pi\}\}respectively\.

###### Definition 25\(Truthfulness\)\.

A policyπ\\piis*truthful*regarding questions about variableYYif

∀y∈d​o​m​\(Y\):P​rℳπ∗​\(\(D≠y\)∧\(Y=y\)∣Q=“Y”\)=0\\displaystyle\\forall y\\in dom\(Y\):Pr\_\{\\mathcal\{M\}\_\{\\pi\}^\{\*\}\}\(\(D\\neq y\)\\wedge\(Y=y\)\\mid Q=\\text\{\\ltxml@oqmark@open\\textquotedblleft\\penalty 10000\\thinspace\\thinspace Y\\textquotedblright\\ltxml@oqmark@close\{\}\}\)=0

Before we give the formal definition of honesty, note thatP​rℳπA​\(Y∣𝐩𝐚D\)Pr\_\{\\mathcal\{M\}^\{A\}\_\{\\pi\}\}\(Y\\mid\\mathbf\{pa\}^\{D\}\)is a Markov kernelK:d​o​m​\(𝐩𝐚D\)→d​o​m​\(Y\)K:dom\(\\mathbf\{pa\}^\{D\}\)\\rightarrow dom\(Y\)\. Consequently, we can define a function

f:d​o​m​\(𝐏𝐚D\)\\displaystyle f:dom\(\\mathbf\{Pa\}^\{D\}\)→d​o​m​\(Y\)\\displaystyle\\rightarrow dom\(Y\)𝐩𝐚D\\displaystyle\\mathbf\{pa\}^\{D\}↦\\displaystyle\\mapstoarg⁡maxy^∈d​o​m​\(Y\)⁡P​rℳA​\(Y=y^∣𝐏𝐚D=𝐩𝐚D\)\\displaystyle\\arg\\hskip\-6\.64001pt\\max\_\{\\hat\{y\}\\in dom\(Y\)\}Pr\_\{\\mathcal\{M\}^\{A\}\}\(Y=\\hat\{y\}\\mid\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)giving the value ofYYwhich is subjectively most likely for an agent whose model of the world isℳA\\mathcal\{M\}^\{A\}if they observe𝑷​𝒂D=𝒑​𝒂D\\bm\{Pa\}^\{D\}=\\bm\{pa\}^\{D\}\.

This allows us to define honesty as follows:

###### Definition 26\(Honesty\)\.

A policyπ\\piis*honest*regarding questions about variableYYif for all𝐩𝐚D∈d​o​m​\(𝐏𝐚D\)\\mathbf\{pa\}^\{D\}\\in dom\(\\mathbf\{Pa\}^\{D\}\),

∀y∈d​o​m​\(Y\):\\displaystyle\\forall y\\in dom\(Y\):P​rℳπ∗​\(\(D≠y\)∧f​\(𝐩𝐚D\)=y∣\(𝐏𝐚D=𝐩𝐚D\)∧\(Q=“Y”\)\)\\displaystyle Pr\_\{\\mathcal\{M\}^\{\*\}\_\{\\pi\}\}\(\(D\\neq y\)\\wedge f\(\\mathbf\{pa\}^\{D\}\)=y\\mid\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\\wedge\(Q=\\text\{\\ltxml@oqmark@open\\textquotedblleft\\penalty 10000\\thinspace\\thinspace Y\\textquotedblright\\ltxml@oqmark@close\{\}\}\)\)=0\\displaystyle=0

That is,π\\piis honest if it never chooses any answer except one which is subjectively most likely according to the agent’s world model and information\.

###### Definition 27\(Regret\)\.

In a CIDℳ∗\\mathcal\{M\}^\{\*\}with utility nodeUUand under distributionσ\\sigma\(cf\.[11](https://arxiv.org/html/2606.12268#Thmdefinition11)\), an agent’sregretδ\\deltaunder policyπ\\piis equal to the amount of expected utility forgone compared to the optimal policyπ∗\\pi^\{\*\}:

δ:=𝔼σπ∗​\[U\]−𝔼σπ​\[U\]\\displaystyle\\delta:=\\mathbb\{E\}\_\{\\sigma\}^\{\\pi^\{\*\}\}\[U\]\-\\mathbb\{E\}\_\{\\sigma\}^\{\\pi\}\[U\]

###### Definition 28\(Capability\)\.

An agentΓ\\Gammaclears a capability threshold set by regret boundδ∗\\delta^\{\*\}if, for every distributionσ\\sigmaobtainable via mixtures of local interventions \(for details, see\[[27](https://arxiv.org/html/2606.12268#bib.bib27)\]\), the agent’s regret underσ\\sigmais lower thanδ∗\\delta^\{\*\}

###### Theorem 7\.3\.

For almost all CIDsℳ∗\\mathcal\{M\}^\{\*\}with variables𝐕\\bm\{V\}which satisfy[Definitions˜16](https://arxiv.org/html/2606.12268#Thmdefinition16)and[15](https://arxiv.org/html/2606.12268#Thmdefinition15),111111The first of these conditions can be somewhat relaxed based on the results of\[[5](https://arxiv.org/html/2606.12268#bib.bib5)\]\.ifY∈𝐕Y\\in\\bm\{V\}is guessable at decision variableD∈𝐃⊆𝐕D\\in\\bm\{D\}\\subseteq\{\\bm\{V\}\}and utilityUU, then there exists a capability threshold in the form of a regret boundδ∗\\delta^\{\*\}such that if an agentΓ\\Gammaclears this threshold, its best guessesP​rℳA​\(Y∣𝐏𝐚D\)Pr\_\{\\mathcal\{M\}^\{A\}\}\(Y\\mid\\mathbf\{Pa\}^\{D\}\)will always be correct ifP​rℳ∗​\(𝐏𝐚D\)\>0Pr\_\{\\mathcal\{M\}^\{\*\}\}\(\\mathbf\{Pa\}^\{D\}\)\>0\.

###### Corollary 2\.

Under these assumptions, the policies which are truthful regarding questions aboutYYare exactly the ones that are honest regarding questions aboutYY\.

###### Proof\.

Theorem 2 from\[[27](https://arxiv.org/html/2606.12268#bib.bib27)\]shows that, under the assumptions of[Theorem˜7\.3](https://arxiv.org/html/2606.12268#S7.Thmtheorem3), we get an error boundγ​\(δ\)\\gamma\(\\delta\)such that a\)γ​\(δ\)∈𝒪​\(δ\)\\gamma\(\\delta\)\\in\\mathcal\{O\}\(\\delta\)and b\) for all chance nodes ofℳ∗\\mathcal\{M\}^\{\*\}C∈𝑿C\\in\\bm\{X\}and valuesc∈d​o​m​\(C\)c\\in dom\(C\),

P​rℳA​\(c∣𝐩𝐚C\)=P​rℳ∗​\(c∣𝐩𝐚C\)\+𝒪​\(δ\)Pr\_\{\\mathcal\{M\}^\{A\}\}\(c\\mid\\mathbf\{pa\}^\{C\}\)=Pr\_\{\\mathcal\{M\}^\{\*\}\}\(c\\mid\\mathbf\{pa\}^\{C\}\)\+\\mathcal\{O\}\(\\delta\)
First, let us consider how the error of the prior of any specific variableYYhaving valuey∈d​o​m​\(Y\)y\\in dom\(Y\)is bounded\. Let\{Bi\}i=1k\\\{B\_\{i\}\\\}\_\{i=1\}^\{k\}be a topological ordering ofB=\{Y\}∪𝑨​𝒏​𝒄𝒀B=\\\{Y\\\}\\cup\\bm\{Anc\_\{Y\}\}, that is, an ordering wherej\>ij\>iwheneverBiB\_\{i\}has an arrow toBjB\_\{j\}; it immediately follows that necessarilyBk=YB\_\{k\}=Y\. It suffices to get bounds on\|P​rℳA​\(y\)−P​rℳ∗​\(y\)\|\|Pr\_\{\\mathcal\{M\}^\{A\}\}\(y\)\-Pr\_\{\\mathcal\{M\}^\{\\ast\}\}\(y\)\|\. By the local Markov property of CIDs and the chain rule, we can write:

P​rℳA​\(y\)\\displaystyle Pr\_\{\\mathcal\{M\}^\{A\}\}\(y\)=∑𝒃:bk=y∏i=1kP​rℳA​\(bi∣𝐩𝐚Bi\),\\displaystyle=\\sum\_\{\\bm\{b\}:\\,b\_\{k\}=y\}\\prod\_\{i=1\}^\{k\}Pr\_\{\\mathcal\{M\}^\{A\}\}\(b\_\{i\}\\mid\\mathbf\{pa\}^\{B\_\{i\}\}\),P​rℳ∗​\(y\)\\displaystyle Pr\_\{\\mathcal\{M\}^\{\\ast\}\}\(y\)=∑𝒃:bk=y∏i=1kP​rℳ∗​\(bi∣𝐩𝐚Bi\),\\displaystyle=\\sum\_\{\\bm\{b\}:\\,b\_\{k\}=y\}\\prod\_\{i=1\}^\{k\}Pr\_\{\\mathcal\{M\}^\{\\ast\}\}\(b\_\{i\}\\mid\\mathbf\{pa\}^\{B\_\{i\}\}\),where each𝒃=\(b1,…,bk−1,bk=y\)\\bm\{b\}=\(b\_\{1\},\\dots,b\_\{k\-1\},b\_\{k\}=y\)is an admissible joint setting of the variables inBBwithbk=yb\_\{k\}=y\(so𝒃\\bm\{b\}ranges over all joint settings that produceY=yY=y\), and𝐩𝐚Bi\\mathbf\{pa\}^\{B\_\{i\}\}refers to the values that the parents ofBiB\_\{i\}take in𝒃\\bm\{b\}\. For clarity, write

piA​\(𝒃\)=P​rℳA​\(bi∣𝐩𝐚Bi\)\\displaystyle p^\{A\}\_\{i\}\(\\bm\{b\}\)=Pr\_\{\\mathcal\{M\}^\{A\}\}\(b\_\{i\}\\mid\\mathbf\{pa\}^\{B\_\{i\}\}\),pi∗\(𝒃\)=Prℳ∗\(bi∣𝐩𝐚Bi\)\\displaystyle,p^\{\\ast\}\_\{i\}\(\\bm\{b\}\)=Pr\_\{\\mathcal\{M\}^\{\\ast\}\}\(b\_\{i\}\\mid\\mathbf\{pa\}^\{B\_\{i\}\}\)pA​\(𝒃\)=∏i=1kpiA​\(𝒃\)\\displaystyle p^\{A\}\(\\bm\{b\}\)=\\displaystyle\\prod\_\{i=1\}^\{k\}p^\{A\}\_\{i\}\(\\bm\{b\}\),p∗\(𝒃\)=∏i=1kpi∗\(𝒃\)\.\\displaystyle,p^\{\\ast\}\(\\bm\{b\}\)=\\displaystyle\\prod\_\{i=1\}^\{k\}p^\{\\ast\}\_\{i\}\(\\bm\{b\}\)\.Now, because for any pair of sequences\{ai\}i=1k,\{bi\}i=1k\\\{a\_\{i\}\\\}\_\{i=1\}^\{k\},\\\{b\_\{i\}\\\}\_\{i=1\}^\{k\}it is the case that∏i=1kai−∏i=1kbi=∑i=1k\(\(ai−bi\)⋅∏j<iaj⋅∏j\>ibj\)\\displaystyle\\prod\_\{i=1\}^\{k\}a\_\{i\}\-\\displaystyle\\prod\_\{i=1\}^\{k\}b\_\{i\}=\\displaystyle\\sum\_\{i=1\}^\{k\}\\left\(\(a\_\{i\}\-b\_\{i\}\)\\cdot\\displaystyle\\prod\_\{j<i\}a\_\{j\}\\cdot\\displaystyle\\prod\_\{j\>i\}b\_\{j\}\\right\), we may factor the error as:

\|P​rℳA​\(y\)−P​rℳ∗​\(y\)\|\\displaystyle\|Pr\_\{\\mathcal\{M\}^\{A\}\}\(y\)\-Pr\_\{\\mathcal\{M\}^\{\\ast\}\}\(y\)\|=\|∑𝒃:bk=y\[∏i=1kpiA​\(𝒃\)−∏i=1kpi∗​\(𝒃\)\]\|\\displaystyle=\\left\|\\sum\_\{\\bm\{b\}:\\,b\_\{k\}=y\}\\left\[\\prod\_\{i=1\}^\{k\}p^\{A\}\_\{i\}\(\\bm\{b\}\)\-\\prod\_\{i=1\}^\{k\}p^\{\\ast\}\_\{i\}\(\\bm\{b\}\)\\right\]\\right\|=\|∑𝒃:bk=y∑i=1k\(\(piA​\(𝒃\)−pi∗​\(𝒃\)\)⋅∏j<ipjA​\(𝒃\)⋅∏j\>ipj∗​\(𝒃\)\)\|\\displaystyle=\\left\|\\sum\_\{\\bm\{b\}:\\,b\_\{k\}=y\}\\sum\_\{i=1\}^\{k\}\\left\(\\left\(p^\{A\}\_\{i\}\(\\bm\{b\}\)\-p^\{\\ast\}\_\{i\}\(\\bm\{b\}\)\\right\)\\cdot\\displaystyle\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j\>i\}p^\{\\ast\}\_\{j\}\(\\bm\{b\}\)\\right\)\\right\|≤∑𝒃:bk=y∑i=1k\(\|piA​\(𝒃\)−pi∗​\(𝒃\)\|⋅∏j<ipjA​\(𝒃\)⋅∏j\>ipj∗​\(𝒃\)\)\\displaystyle\\leq\\sum\_\{\\bm\{b\}:\\,b\_\{k\}=y\}\\sum\_\{i=1\}^\{k\}\\left\(\\left\|p^\{A\}\_\{i\}\(\\bm\{b\}\)\-p^\{\\ast\}\_\{i\}\(\\bm\{b\}\)\\right\|\\cdot\\displaystyle\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j\>i\}p^\{\\ast\}\_\{j\}\(\\bm\{b\}\)\\right\)=∑𝒃:bk=y∑i=1k\(ϵi​\(𝒃\)⋅∏j<ipjA​\(𝒃\)⋅∏j\>ipj∗​\(𝒃\)\),\\displaystyle=\\sum\_\{\\bm\{b\}:\\,b\_\{k\}=y\}\\sum\_\{i=1\}^\{k\}\\left\(\\epsilon\_\{i\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j\>i\}p^\{\\ast\}\_\{j\}\(\\bm\{b\}\)\\right\),whereϵi​\(𝒃\):=\|piA​\(𝒃\)−pi∗​\(𝒃\)\|\\epsilon\_\{i\}\(\\bm\{b\}\):=\|p^\{A\}\_\{i\}\(\\bm\{b\}\)\-p^\{\\ast\}\_\{i\}\(\\bm\{b\}\)\|is the absolute error for𝒃\\bm\{b\}, and by assumptionϵi​\(𝒃\)∈𝒪​\(δ\)\\epsilon\_\{i\}\(\\bm\{b\}\)\\in\\mathcal\{O\}\(\\delta\), i\.e\.,ϵi​\(𝒃\)≤C​δ\\epsilon\_\{i\}\(\\bm\{b\}\)\\leq C\\deltafor some constantCC\.

Now we bound each \(fixed\)ii\-summand\. Note that, withiifixed,ϵi​\(𝒃\)⋅∏j<ipjA​\(𝒃\)\\epsilon\_\{i\}\(\\bm\{b\}\)\\cdot\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)depends only on𝒃≤i=\(b1,…,bi\)\\bm\{b\}\_\{\\leq i\}=\(b\_\{1\},\\dots,b\_\{i\}\), while∏j\>ipj∗​\(𝒃\)\\prod\_\{j\>i\}p^\{\\ast\}\_\{j\}\(\\bm\{b\}\)depends on𝒃\>i=\(bi\+1,…,bk\)\\bm\{b\}\_\{\>i\}=\(b\_\{i\+1\},\\dots,b\_\{k\}\)\(and on the parents of those nodes, which lie in𝒃≤i\\bm\{b\}\_\{\\leq i\}\)\. The constraintbk=yb\_\{k\}=ytouches only the second factor, so we may split the sum as:

∑𝒃:bk=y\(ϵi​\(𝒃\)⋅∏j<ipjA​\(𝒃\)⋅∏j\>ipj∗​\(𝒃\)\)\\displaystyle\\sum\_\{\\bm\{b\}:\\,b\_\{k\}=y\}\\left\(\\epsilon\_\{i\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j\>i\}p^\{\\ast\}\_\{j\}\(\\bm\{b\}\)\\right\)=∑𝒃≤iϵi​\(𝒃\)⋅∏j<ipjA​\(𝒃\)⋅∑𝒃\>i:bk=y∏j\>ipj∗​\(𝒃\)\\displaystyle=\\sum\_\{\\bm\{b\}\_\{\\leq i\}\}\\epsilon\_\{i\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)\\cdot\\sum\_\{\\bm\{b\}\_\{\>i\}:\\,b\_\{k\}=y\}\\displaystyle\\prod\_\{j\>i\}p^\{\\ast\}\_\{j\}\(\\bm\{b\}\)≤∑𝒃≤iϵi​\(𝒃\)⋅∏j<ipjA​\(𝒃\),\\displaystyle\\leq\\sum\_\{\\bm\{b\}\_\{\\leq i\}\}\\epsilon\_\{i\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\),where the inner sum is bounded by11because, by the local Markov property,∑𝒃\>i:bk=y∏j\>ipj∗​\(𝒃\)=P​rℳ∗​\(y∣𝒃≤i\)\\sum\_\{\\bm\{b\}\_\{\>i\}:\\,b\_\{k\}=y\}\\prod\_\{j\>i\}p^\{\\ast\}\_\{j\}\(\\bm\{b\}\)=Pr\_\{\\mathcal\{M\}^\{\\ast\}\}\(y\\mid\\bm\{b\}\_\{\\leq i\}\), a probability\. Then becauseϵi​\(𝒃\)≤C​δ\\epsilon\_\{i\}\(\\bm\{b\}\)\\leq C\\delta,

∑𝒃≤iϵi​\(𝒃\)⋅∏j<ipjA​\(𝒃\)\\displaystyle\\sum\_\{\\bm\{b\}\_\{\\leq i\}\}\\epsilon\_\{i\}\(\\bm\{b\}\)\\cdot\\displaystyle\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)≤C​δ⋅∑𝒃≤i∏j<ipjA​\(𝒃\)\\displaystyle\\leq C\\delta\\cdot\\sum\_\{\\bm\{b\}\_\{\\leq i\}\}\\displaystyle\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)=C⋅\|d​o​m​\(Bi\)\|⋅δ,\\displaystyle=C\\cdot\|dom\(B\_\{i\}\)\|\\cdot\\delta,where the last equality holds because the integrand∏j<ipjA​\(𝒃\)\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)does not depend onbib\_\{i\}, so summing overbib\_\{i\}contributes a factor of\|d​o​m​\(Bi\)\|\|dom\(B\_\{i\}\)\|, and∑𝒃<i∏j<ipjA​\(𝒃\)=∑𝒃<iP​rℳA​\(𝒃<i\)=1\\sum\_\{\\bm\{b\}\_\{<i\}\}\\prod\_\{j<i\}p^\{A\}\_\{j\}\(\\bm\{b\}\)=\\sum\_\{\\bm\{b\}\_\{<i\}\}Pr\_\{\\mathcal\{M\}^\{A\}\}\(\\bm\{b\}\_\{<i\}\)=1as the sum over all values of a marginal\. SettingC′:=C⋅maxi⁡\|d​o​m​\(Bi\)\|C^\{\\prime\}:=C\\cdot\\max\_\{i\}\|dom\(B\_\{i\}\)\|, the per\-iibound isC′​δC^\{\\prime\}\\delta\.

Summing the per\-iibound across thekkancestral nodes ofYY, the total error is of order𝒪​\(k​δ\)\\mathcal\{O\}\(k\\delta\), with implicit constant scaling like∑i\|d​o​m​\(Bi\)\|\\sum\_\{i\}\|dom\(B\_\{i\}\)\|\. For a fixed CID, bothkkand the domain sizes are constants, so the bound simplifies to𝒪​\(δ\)\\mathcal\{O\}\(\\delta\); if one were to compare across CIDs of growing size, the implicit constant would grow accordingly\. Analogous arguments give error bounds of𝒪​\(δ\)\\mathcal\{O\}\(\\delta\)for both the joint probability distributionP​rℳA​\(Y,𝑷​𝒂D\)Pr\_\{\\mathcal\{M\}^\{A\}\}\(Y,\\bm\{Pa\}^\{D\}\)and the priorP​rℳA​\(𝑷​𝒂D\)Pr\_\{\\mathcal\{M\}^\{A\}\}\(\\bm\{Pa\}^\{D\}\)\.

So we have:

P​rℳA​\(Y,𝐏𝐚D\)=P​rℳ∗​\(Y,𝐏𝐚D\)\+𝒪​\(δ\)\\displaystyle Pr\_\{\\mathcal\{M\}^\{A\}\}\(Y,\\mathbf\{Pa\}^\{D\}\)=Pr\_\{\\mathcal\{M\}^\{\*\}\}\(Y,\\mathbf\{Pa\}^\{D\}\)\+\\mathcal\{O\}\(\\delta\)and

P​rℳA​\(𝐏𝐚D\)=P​rℳ∗​\(𝐏𝐚D\)\+𝒪​\(δ\)\\displaystyle Pr\_\{\\mathcal\{M\}^\{A\}\}\(\\mathbf\{Pa\}^\{D\}\)=Pr\_\{\\mathcal\{M\}^\{\*\}\}\(\\mathbf\{Pa\}^\{D\}\)\+\\mathcal\{O\}\(\\delta\)
which will give us

P​rℳA​\(Y∣𝐏𝐚D\)=P​rℳ∗​\(Y∣𝐏𝐚D\)\+𝒪​\(δ\)\\displaystyle Pr\_\{\\mathcal\{M\}^\{A\}\}\(Y\\mid\\mathbf\{Pa\}^\{D\}\)=Pr\_\{\\mathcal\{M\}^\{\*\}\}\(Y\\mid\\mathbf\{Pa\}^\{D\}\)\+\\mathcal\{O\}\(\\delta\)
when combined with the ratio formula for conditional probability and the following two facts about𝒪\\mathcal\{O\}\(whenx→0\)x\\rightarrow 0\):

\(c\+𝒪​\(x\)\)​\(d\+𝒪​\(x\)\)\\displaystyle\(c\+\\mathcal\{O\}\(x\)\)\(d\+\\mathcal\{O\}\(x\)\)=c​d\+𝒪​\(x\)\\displaystyle=cd\+\\mathcal\{O\}\(x\)\(2\)1c\+𝒪​\(x\)\\displaystyle\\frac\{1\}\{c\+\\mathcal\{O\}\(x\)\}=1c\+𝒪​\(x\)\\displaystyle=\\frac\{1\}\{c\}\+\\mathcal\{O\}\(x\)\(3\)
[Eq\.˜2](https://arxiv.org/html/2606.12268#S7.E2)is very straightforwardly verified\. For[Eq\.˜3](https://arxiv.org/html/2606.12268#S7.E3), assumec\>0c\>0\(or, more generally, thatccis bounded away from0on the domain of interest\) and consider that forϵ∈𝒪​\(x\)\\epsilon\\in\\mathcal\{O\}\(x\),

1c\+ϵ=1c​\(11\+ϵc\)\\displaystyle\\frac\{1\}\{c\+\\epsilon\}=\\frac\{1\}\{c\}\\left\(\\frac\{1\}\{1\+\\frac\{\\epsilon\}\{c\}\}\\right\)=1c​∑i=0∞\(−ϵc\)i\\displaystyle=\\frac\{1\}\{c\}\\sum\_\{i=0\}^\{\\infty\}\\left\(\-\\frac\{\\epsilon\}\{c\}\\right\)^\{i\}=1c​\(1−ϵc\+𝒪​\(x2\)\)\\displaystyle=\\frac\{1\}\{c\}\\left\(1\-\\frac\{\\epsilon\}\{c\}\+\\mathcal\{O\}\(x^\{2\}\)\\right\)=1c\+𝒪​\(x\),\\displaystyle=\\frac\{1\}\{c\}\+\\mathcal\{O\}\(x\),
where the geometric\-series expansion is valid for\|ϵ/c\|<1\|\\epsilon/c\|<1, which holds forxxsufficiently small\. Applied toc=P​rℳ∗​\(𝐏𝐚D=𝐩𝐚D\)c=Pr\_\{\\mathcal\{M\}^\{\*\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\), this requires the prior over parent\-values to be bounded below on its support; the theorem’s hypothesisP​rℳ∗​\(𝐏𝐚D=𝐩𝐚D\)\>0Pr\_\{\\mathcal\{M\}^\{\*\}\}\(\\mathbf\{Pa\}^\{D\}=\\mathbf\{pa\}^\{D\}\)\>0guarantees positivity pointwise, and the uniform capability threshold below requires that this minimum, taken over all𝐩𝐚D\\mathbf\{pa\}^\{D\}in the support, be strictly positive\.

So we have, for every valueyyofYY,

P​rℳA​\(y∣𝐏𝐚D\)=P​rℳ∗​\(y∣𝐏𝐚D\)\+𝒪​\(δ\)\.\\displaystyle Pr\_\{\\mathcal\{M\}^\{A\}\}\(y\\mid\\mathbf\{Pa\}^\{D\}\)=Pr\_\{\\mathcal\{M\}^\{\*\}\}\(y\\mid\\mathbf\{Pa\}^\{D\}\)\+\\mathcal\{O\}\(\\delta\)\.
By guessability and[Lemma˜2](https://arxiv.org/html/2606.12268#Thmlemma2),P​rℳ∗​\(y∗∣𝐏𝐚D\)=1Pr\_\{\\mathcal\{M\}^\{\*\}\}\(y^\{\*\}\\mid\\mathbf\{Pa\}^\{D\}\)=1for the true valuey∗y^\{\*\}ofYY, and henceP​rℳ∗​\(y′∣𝐏𝐚D\)=0Pr\_\{\\mathcal\{M\}^\{\*\}\}\(y^\{\\prime\}\\mid\\mathbf\{Pa\}^\{D\}\)=0for everyy′≠y∗y^\{\\prime\}\\neq y^\{\*\}\. The above bound therefore yields

P​rℳA​\(y∗∣𝐏𝐚D\)\\displaystyle Pr\_\{\\mathcal\{M\}^\{A\}\}\(y^\{\*\}\\mid\\mathbf\{Pa\}^\{D\}\)≥1−𝒪​\(δ\),\\displaystyle\\geq 1\-\\mathcal\{O\}\(\\delta\),P​rℳA​\(y′∣𝐏𝐚D\)\\displaystyle Pr\_\{\\mathcal\{M\}^\{A\}\}\(y^\{\\prime\}\\mid\\mathbf\{Pa\}^\{D\}\)≤𝒪​\(δ\)for every​y′≠y∗\.\\displaystyle\\leq\\mathcal\{O\}\(\\delta\)\\qquad\\text\{for every \}y^\{\\prime\}\\neq y^\{\*\}\.
Choosingδ\\deltasmall enough that the𝒪​\(δ\)\\mathcal\{O\}\(\\delta\)term is below1/21/2ensuresP​rℳA​\(y∗∣𝐏𝐚D\)\>P​rℳA​\(y′∣𝐏𝐚D\)Pr\_\{\\mathcal\{M\}^\{A\}\}\(y^\{\*\}\\mid\\mathbf\{Pa\}^\{D\}\)\>Pr\_\{\\mathcal\{M\}^\{A\}\}\(y^\{\\prime\}\\mid\\mathbf\{Pa\}^\{D\}\)for everyy′≠y∗y^\{\\prime\}\\neq y^\{\*\}, soy∗y^\{\*\}is the unique mode ofPrℳA\(⋅∣𝐏𝐚D\)Pr\_\{\\mathcal\{M\}^\{A\}\}\(\\cdot\\mid\\mathbf\{Pa\}^\{D\}\)and the agent’s likeliest answer is the true one\.

∎

### 7\.9Proof of impossibility statements

Before we prove the“impossibility”statements of section[5](https://arxiv.org/html/2606.12268#S5), some more definitions are required\.

While we used the notion of being robustly capable in a model on a set of distributions in an intuitive sense in section[5](https://arxiv.org/html/2606.12268#S5), we now define it as follows:

###### Definition 29\(Robust Capability on Distributions\)\.

Letℳ\\mathcal\{M\}be a CID,𝒮\\mathcal\{S\}a set of distributions onℳ\\mathcal\{M\}, andUUa utility function\. We call an agentΓ\\Gammaacting inℳ\\mathcal\{M\}*robustly capable on𝒮\\mathcal\{S\}w\.r\.t\.UU*if, for every𝝈∈𝒮\\bm\{\\sigma\}\\in\\mathcal\{S\},Γ​\(𝝈\)\\Gamma\(\\bm\{\\sigma\}\)is an optimal policy inℳ𝝈\\mathcal\{M\}\_\{\\bm\{\\sigma\}\}\.

###### Definition 30\.

For a training algorithmAA, an agentΓ\\Gammais such that*A doesn’t see any need to improveΓ\\Gamma*given dataset𝒟\\mathcal\{D\}ifA​\(Γ,𝒟\)=ΓA\(\\Gamma,\\mathcal\{D\}\)=\\Gamma\.

###### Definition 31\(Indifference of a Training Strategy\)\.

A training strategy𝒯=\(U,A\)\\mathcal\{T\}=\(U,A\)is indifferent between agentsΓ1\\Gamma\_\{1\}andΓ2\\Gamma\_\{2\}if \(for all the developers know\) the training algorithmA=\(D,α\)A=\(D,\\alpha\)may output eitherΓ1\\Gamma\_\{1\}orΓ2\\Gamma\_\{2\}, i\.e\., for any initialised agentΓ0\\Gamma\_\{0\},α​\(𝒟,Γ0\)\\alpha\(\\mathcal\{D\},\\Gamma\_\{0\}\)may be eitherΓ1\\Gamma\_\{1\}orΓ2\\Gamma\_\{2\}\.

###### Example 6\.

An example of a training strategy𝒯\\mathcal\{T\}and agentsΓ1,Γ2\\Gamma\_\{1\},\\Gamma\_\{2\}such that𝒯\\mathcal\{T\}is*not*indifferent betweenΓ1,Γ2\\Gamma\_\{1\},\\Gamma\_\{2\}would be a training strategy that does not terminate unless a certain capability threshold had been reached \(and perhaps outputs the initial agentΓ0\\Gamma\_\{0\}if this is not done after a specified period of time\)\. IfΓ2\\Gamma\_\{2\}clears the threshold andΓ1\\Gamma\_\{1\}does not, this training strategy would not be indifferent between them\.

#### 7\.9\.1[Lemma˜1](https://arxiv.org/html/2606.12268#Thmlemma1)

###### Proof\.

Since𝒯\\mathcal\{T\}may output any robustly capable agent which optimisesUUonℳ∗\\mathcal\{M\}^\{\*\},𝒮\\mathcal\{S\}, we need to show that there is such a robustly capable agent which goal misgeneralises on new distributions\.

ℳ∗\\mathcal\{M\}^\{\*\}being goal\-environment ambiguous on𝒮\\mathcal\{S\}between the training objectiveUUandU~\\tilde\{U\}means \(by[Definition˜7](https://arxiv.org/html/2606.12268#Thmdefinition7)\) that, for each𝝈∈𝒮\\bm\{\\sigma\}\\in\\mathcal\{S\},UUandU~\\tilde\{U\}induce the same set of optimal policies inℳ∗\\mathcal\{M\}^\{\*\}\.

U~\\tilde\{U\}being substantially divergent fromUUmeans that there is𝝈~∈𝑰ℳ∗∖𝒮\\bm\{\\tilde\{\\sigma\}\}\\in\\bm\{I\}\_\{\\mathcal\{M\}^\{\*\}\}\\setminus\\mathcal\{S\}such that on𝝈~\\bm\{\\tilde\{\\sigma\}\}, all policies which are optimal w\.r\.t\.U~\\tilde\{U\}are very suboptimal w\.r\.t\.UU\.

LetΓM\\Gamma\_\{M\}be an agent that is robustly capable w\.r\.t\.U~\\tilde\{U\}on the entirety of𝑰ℳ∗\\bm\{I\}\_\{\\mathcal\{M\}^\{\*\}\}and has a correct \(up to the utility function\) subjective modelℳA\\mathcal\{M\}^\{A\}\(robust capability all but guarantees the latter if[Definitions˜16](https://arxiv.org/html/2606.12268#Thmdefinition16)and[15](https://arxiv.org/html/2606.12268#Thmdefinition15)are satisfied, cf\.[Theorem˜7\.1](https://arxiv.org/html/2606.12268#S7.Thmtheorem1); but even if they are not, such an agent will exist\)\.

By goal\-environment ambiguity betweenUUandU~\\tilde\{U\}on𝒮\\mathcal\{S\}, for each𝝈∈𝒮\\bm\{\\sigma\}\\in\\mathcal\{S\}, ifΓM​\(𝝈\)\\Gamma\_\{M\}\(\\bm\{\\sigma\}\)—that is, the policyΓM\\Gamma\_\{M\}outputs on𝝈\\bm\{\\sigma\}—is optimal w\.r\.t\.U~\\tilde\{U\}, it is also optimal w\.r\.t\.UU\.

By robust capability w\.r\.t\.U~\\tilde\{U\}, the antecedent is indeed the case, i\.e\.,ΓM\\Gamma\_\{M\}outputs optimal policies w\.r\.t\.UUon all of𝒮\\mathcal\{S\}\.

This means \([Definition˜29](https://arxiv.org/html/2606.12268#Thmdefinition29)\) thatΓM\\Gamma\_\{M\}is robustly capable on𝒮\\mathcal\{S\}w\.r\.t\.UU\.

But by substantial divergence,ΓM\\Gamma\_\{M\}is very suboptimal w\.r\.t\.UUon𝝈~\\bm\{\\tilde\{\\sigma\}\}\.

That is,ΓM\\Gamma\_\{M\}goal misgeneralises on𝝈~\\bm\{\\tilde\{\\sigma\}\}\.

∎

#### 7\.9\.2[Theorem˜5\.1](https://arxiv.org/html/2606.12268#S5.Thmtheorem1)

We model the evaluator by a nodeE∈𝑽E\\in\\bm\{V\}\.EEis a parent node of the agent’s utility nodeUU:EEdecides what the correct answer to a question would have been based on𝐏𝐚E\\mathbf\{Pa\}^\{E\}, andU​\(D,E\)=1U\(D,E\)=1ifE=DE=D, and 0 otherwise\.

###### Definition 32\(Learnable Mistakes\)\.

For a CIDℳ∗\\mathcal\{M\}^\{\*\}, distributions𝒮\\mathcal\{S\}, evaluator nodeE∈𝑽E\\in\\bm\{V\}makes*learnable mistakes*from the perspective of the decision nodeDDif there is a distribution𝝈∈𝒮\\bm\{\\sigma\}\\in\\mathcal\{S\}and a variableY∈𝑽Y\\in\\bm\{V\}s\.t\.argmaxy∈d​o​m​\(Y\)Pℳ𝝈∗\(E=y∣Q=“Y",𝐏𝐚D\)≠argmaxy∈d​o​m​\(Y\)Pℳ𝝈∗\(Y=y∣Q=“Y",𝐏𝐚D\)\\arg\\max\_\{y\\in dom\(Y\)\}P\_\{\\mathcal\{M\}^\{\*\}\_\{\\bm\{\\sigma\}\}\}\(E=y\\mid Q=\\text\{\{\`\`Y"\}\},\\mathbf\{Pa\}^\{D\}\)\\neq\\arg\\max\_\{y\\in dom\(Y\)\}P\_\{\\mathcal\{M\}^\{\*\}\_\{\\bm\{\\sigma\}\}\}\(Y=y\\mid Q=\\text\{\{\`\`Y"\}\},\\mathbf\{Pa\}^\{D\}\)\.

That is, the true causal structureℳ∗\\mathcal\{M\}^\{\*\}\(which includes the evaluation mechanismEE\) is such that, from the perspective of the agent, their best guess for what the truth is is different from what their best guess for what the evaluator will think the truth is\.

\(Note that ifEEis infallible, i\.e\. it robustly tracks all the nodes about which questions can be asked—ifEE’s structural function always outputs the true value of the variable which the question indicated—this cannot happen, sinceP​rℳ∗​\(Y\)Pr\_\{\\mathcal\{M\}^\{\*\}\}\(Y\)will always be equal toP​rℳ∗​\(E∣Q=“Y"\)Pr\_\{\\mathcal\{M\}^\{\*\}\}\(E\\mid Q=\\text\{\`\`Y"\}\)\)\.

We now restate[Theorem˜5\.1](https://arxiv.org/html/2606.12268#S5.Thmtheorem1)more explicitly:

###### Theorem 7\.4\.

For a training strategy𝒯\\mathcal\{T\}on a domain dependent, unmediated environmentℳ∗\\mathcal\{M\}^\{\*\}containing evaluator nodeEEand training distributions𝒮\\mathcal\{S\}: IfEEmakes learnable mistakes on𝒮\\mathcal\{S\}then every robustly capable agent on some𝛔∈𝒮\\bm\{\\sigma\}\\in\\mathcal\{S\}w\.r\.t\. the training utility \(given by[Eq\.˜1](https://arxiv.org/html/2606.12268#S5.E1)\) is dishonest—it is an evaluation simulator\.

###### Proof\.

LetΓ\\Gammabe a robustly capable agent in a CIDℳ∗\\mathcal\{M\}^\{\*\}with evaluator nodeEEsuch thatEEmakes learnable mistakes\.

LetYYandyEy\_\{E\}be as in[Definition˜32](https://arxiv.org/html/2606.12268#Thmdefinition32), i\.e\. when asked aboutYY,yE=argmaxy∈d​o​m​\(Y\)Pℳ𝝈∗\(E=y∣Q=“Y",𝐏𝐚D\)y\_\{E\}=\\arg\\max\_\{y\\in dom\(Y\)\}P\_\{\\mathcal\{M\}^\{\*\}\_\{\\bm\{\\sigma\}\}\}\(E=y\\mid Q=\\text\{\{\`\`Y"\}\},\\mathbf\{Pa\}^\{D\}\)butyE≠y∗=argmaxy∈d​o​m​\(Y\)Pℳ𝝈∗\(Y=y∣Q=“Y",𝐏𝐚D\)y\_\{E\}\\neq y^\{\*\}=\\arg\\max\_\{y\\in dom\(Y\)\}P\_\{\\mathcal\{M\}^\{\*\}\_\{\\bm\{\\sigma\}\}\}\(Y=y\\mid Q=\\text\{\{\`\`Y"\}\},\\mathbf\{Pa\}^\{D\}\)\.

However, since the agent’s utility depends on what the evaluator says; and specifically, is maximised if its own answer matches the evaluator’s, its optimal policy will answeryEy\_\{E\}instead ofy∗y^\{\*\}—a dishonest answer\.

SinceΓ\\Gammais robustly capable, it outputs the optimal policy\. Thus, its answer isyEy\_\{E\}, noty∗y^\{\*\}\. By[Theorem˜7\.3](https://arxiv.org/html/2606.12268#S7.Thmtheorem3), this is not what it believes\. It is a dishonest evaluation simulator\. ∎

#### 7\.9\.3[Theorem˜5\.2](https://arxiv.org/html/2606.12268#S5.Thmtheorem2)

###### Proof\.

We show that there are two substantially divergent utility functionsUUandU~\\tilde\{U\}such that the training environment is ambiguous \([Definition˜7](https://arxiv.org/html/2606.12268#Thmdefinition7)\) between them, and that there is a𝝈~∉𝒮\\bm\{\\tilde\{\\sigma\}\}\\notin\\mathcal\{S\}such that the optimal policy w\.r\.t\.UUis to be an honest agent, but the optimal policy w\.r\.t\.U~\\tilde\{U\}is an evaluation simulator\.

Then[Theorem˜5\.2](https://arxiv.org/html/2606.12268#S5.Thmtheorem2)follows from[Lemma˜1](https://arxiv.org/html/2606.12268#Thmlemma1)\.

If the evaluator never makes mistakes in the training environment, that is, ifargmaxyPrℳσ∗\(Y=y∣Q=Y’,𝐏𝐚D\)\\arg\\max\_\{y\}Pr\_\{\\mathcal\{M\}^\{\*\}\_\{\\sigma\}\}\(Y=y\\mid Q=\\text\{Y'\},\\mathbf\{Pa\}^\{D\}\)is always equal toargmaxyPrℳσ∗\(E=y∣Q=‘Y’,𝐏𝐚D\)\\arg\\max\_\{y\}Pr\_\{\\mathcal\{M\}^\{\*\}\_\{\\sigma\}\}\(E=y\\mid Q=\\text\{\`Y'\},\\mathbf\{Pa\}^\{D\}\), for all𝝈∈𝒮\\bm\{\\sigma\}\\in\\mathcal\{S\}thenℳ∗\\mathcal\{M\}^\{\*\}with distributions𝒮\\mathcal\{S\}is ambiguous between the following two utility functions:

U​\(D,E,Q\)\\displaystyle U\(D,E,Q\)=\{1if​Q=“Y",D=y∗,and for all​y^:Prℳ∗​\(Y=y∗∣𝐏𝐚D\)≥Prℳ∗​\(Y=y^∣𝐏𝐚D\),0otherwise\.\\displaystyle=\\begin\{cases\}1&\\text\{if \}Q=\\text\{\`\`Y"\},\\ D=y^\{\*\},\\text\{ and for all \}\\hat\{y\}:\\\\ &\\text\{Pr\}\_\{\\mathcal\{M\}^\{\*\}\}\(Y=y^\{\*\}\\mid\\mathbf\{Pa\}^\{D\}\)\\geq\\text\{Pr\}\_\{\\mathcal\{M\}^\{\*\}\}\(Y=\\hat\{y\}\\mid\\mathbf\{Pa\}^\{D\}\),\\\\ 0&\\text\{otherwise\.\}\\end\{cases\}U~​\(D,E,Q\)\\displaystyle\\tilde\{U\}\(D,E,Q\)=\{1if​D=E0otherwise\.\\displaystyle=\\begin\{cases\}1&\\text\{if \}D=E\\\\ 0&\\text\{otherwise\.\}\\end\{cases\}
That is, if the training objective corresponds to the judgments of a \(perfect\) evaluator, then the environment is ambiguous between two utility functions: the one which rewards saying what is true, and the one which rewards saying what the evaluator belives to be true\.

However, by assumption, there is𝝈~∈𝑰ℳ∗∖𝒮\\tilde\{\\bm\{\\sigma\}\}\\in\\bm\{I\}\_\{\\mathcal\{M\}^\{\*\}\}\\setminus\\mathcal\{S\}such that on𝝈~\\tilde\{\\bm\{\\sigma\}\}, the evaluator does make mistakes\. That is, there will beY∈𝑽Y\\in\\bm\{V\}and valuesy∗y^\{\*\},yEy\_\{E\}, such thatyE=argmaxy∈d​o​m​\(Y\)Pℳ𝝈~∗\(E=y∣Q=“Y",𝐏𝐚D\)≠y∗=argmaxy∈d​o​m​\(Y\)Pℳ𝝈~∗\(Y=y∣Q=“Y",𝐏𝐚D\)y\_\{E\}=\\arg\\max\_\{y\\in dom\(Y\)\}P\_\{\\mathcal\{M\}^\{\*\}\_\{\\tilde\{\\bm\{\\sigma\}\}\}\}\(E=y\\mid Q=\\text\{\{\`\`Y"\}\},\\mathbf\{Pa\}^\{D\}\)\\neq y^\{\*\}=\\arg\\max\_\{y\\in dom\(Y\)\}P\_\{\\mathcal\{M\}^\{\*\}\_\{\\tilde\{\\bm\{\\sigma\}\}\}\}\(Y=y\\mid Q=\\text\{\{\`\`Y"\}\},\\mathbf\{Pa\}^\{D\}\)\.

So, on𝝈~\\tilde\{\\bm\{\\sigma\}\}, answeringy∗y^\{\*\}toQ=“Y"Q=\\text\{\`\`Y"\}will giveU=1U=1, whereas answeringyEy\_\{E\}will giveU=0U=0; with the situation being reversed forU~\\tilde\{U\}\. Thus,UUandU~\\tilde\{U\}diverge substantially; and the optimal policy w\.r\.t\.UUis to be an honest agent, whereas the optimal policy w\.r\.t\.U~\\tilde\{U\}is an evaluation simulator\. ∎

Similar Articles

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Hugging Face Daily Papers

CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.