Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

arXiv cs.LG Papers

Summary

This paper formalizes the concept of signed compression progress on a sealed audit as a reward that is Goodhart-resistant, proving that cumulative reward telescopes to genuine audit improvement and providing bounds for finite audit panels. It identifies failure modes and validates results with experiments.

arXiv:2606.11417v1 Announce Type: new Abstract: Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_{t-1}) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^{-0.527}; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:47 PM

# Signed Compression Progress on a Sealed Audit is Goodhart-Resistant
Source: [https://arxiv.org/html/2606.11417](https://arxiv.org/html/2606.11417)
###### Abstract

Compression progress is a long\-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience\. The folk claim is that this reward is “credible” because it is paid only for learning\. We make this precise and prove it\. If intrinsic reward is the signed decrease of a fixed sealed\-audit loss,

rtaudit=ℰ​\(θt−1\)−ℰ​\(θt\),r\_\{t\}^\{\\rm audit\}=\\mathcal\{E\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\(\\theta\_\{t\}\),then cumulative reward telescopes exactly to endpoint audit improvement\. Consequently no policy can drive reward upward indefinitely while true audit performance stagnates or degrades\. For finite audit panels, the same result holds with a sharp false\-positive budget: cumulative empirical reward is at most true audit improvement plus2​Δn​\(ℱ,δ\)2\\Delta\_\{n\}\(\\mathcal\{F\},\\delta\), whereΔn\\Delta\_\{n\}is the uniform audit deviation of the model class\. This is horizon\-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class\.

The theorem also identifies the failure modes\. The guarantee disappears if progress is clipped, if progress is scored on the agent’s own stream, if a reusable finite panel is exposed to a high\-capacity model, or if a neural class makesΔn\\Delta\_\{n\}vacuous\. We provide a Lean 4 mechanization of the structural core \(telescoping, finite\-audit Goodhart resistance conditional on uniform deviation, finite Gibbs nonnegativity, and the entropy\-floor budget\) and an experiment suite on ARC\-TGI grid\-transformation generators plus adaptive holdout attacks\. The experiments confirm the theory: finite\-audit deviation scales asn−0\.527n^\{\-0\.527\}; signed progress resists clip\-farming, stream leakage, and noisy\-TV curiosity; naive reusable audits are exploitable by black\-box scalar feedback, while fresh subsampling, laddering, rounding, and one\-shot release keep the attack below the2​Δn2\\Delta\_\{n\}threshold\. These results delimit when compression progress is Goodhart\-resistant:*signed compression progress on a sealed audit is an accounting signal of genuine improvement\.*

## 1 Introduction

Intrinsic motivation based on prediction or compression progress appears in Schmidhuber’s work on artificial curiosity and the later compression\-progress theory of interestingness\[[8](https://arxiv.org/html/2606.11417#bib.bib8),[9](https://arxiv.org/html/2606.11417#bib.bib9),[10](https://arxiv.org/html/2606.11417#bib.bib10)\]\. Reward is paid only when the agent’s model improves\. This distinguishes learnable regularity from incompressible noise and should avoid noisy\-TV pathologies that trap raw prediction\-error bonuses\.

But the informal statement is too broad\. A learning agent can improve on its own recently selected stream while becoming worse on the target distribution\. It can forget and relearn the same facts if reward clips away negative progress\. It can overfit a finite validation set if repeated scalar feedback leaks information\. It can exploit a high\-capacity model class until a nominal holdout is no longer a holdout\. These are the Goodhart channels that matter for intrinsic rewards in continual learning and recursive self\-improvement\.

We isolate the representation under which the compression\-progress claim becomes true\. Let𝖰\\mathsf\{Q\}be a fixed audit distribution and letℰ​\(θ\)=𝔼z∼𝖰​ℓ​\(θ,z\)\\mathcal\{E\}\(\\theta\)=\\mathbb\{E\}\_\{z\\sim\\mathsf\{Q\}\}\\ell\(\\theta,z\)be audit log\-loss or any lower\-bounded proper scoring loss\. Define signed audit compression progress by

rtaudit=ℰ​\(θt−1\)−ℰ​\(θt\)\.r\_\{t\}^\{\\rm audit\}=\\mathcal\{E\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\(\\theta\_\{t\}\)\.\(1\)Then the entire reward history is an endpoint identity:

∑t=1Trtaudit=ℰ​\(θ0\)−ℰ​\(θT\)\.\\sum\_\{t=1\}^\{T\}r\_\{t\}^\{\\rm audit\}=\\mathcal\{E\}\(\\theta\_\{0\}\)\-\\mathcal\{E\}\(\\theta\_\{T\}\)\.\(2\)Thus any apparent long\-run reward must be paid for by a genuine reduction in audit loss\. Goodhart resistance here is a property of the measurement frame: it holds because progress is scored against a fixed audit loss\.

#### Contributions\.

We make four contributions\. First, we define*budgeted Goodhart resistance*: a progress signal is credible up to a finite false\-positive budgetΓ\\Gammaif cumulative reward cannot exceed true audit improvement by more thanΓ\\Gamma\. Exact sealed\-audit compression progress hasΓ=0\\Gamma=0; finite panels haveΓ=2​Δn\\Gamma=2\\Delta\_\{n\}\. Second, we mechanize the structural core in Lean 4: exact telescoping, finite\-audit Goodhart resistance under a uniform\-deviation event, finite Gibbs nonnegativity, and an entropy\-floor theorem for incompressible components\. Third, we separate the reward signal from the scheduler: audit compression progress supplies the credible reward, while multiplicative weights / EXP3 provides allocation\. Fourth, we run a focused experiment suite using ARC\-TGI task generators, RND\[[3](https://arxiv.org/html/2606.11417#bib.bib3)\], ICM\[[7](https://arxiv.org/html/2606.11417#bib.bib7)\], prediction\-error curiosity, finite\-audit concentration checks, stream leakage, clipping cycles, reusable\-panel memorization, and black\-box scalar\-feedback holdout attacks\.

![Refer to caption](https://arxiv.org/html/2606.11417v1/x1.png)Figure 1:The measurement frame\. Training data may be selected adaptively, but reward is computed only from the signed change in a fixed audit loss\. This makes intrinsic reward an endpoint accounting identity over the sealed audit\.

## 2 Related work

#### Goodhart’s law and reward hacking\.

Optimizing a proxy until it diverges from the target it stands for is Goodhart’s law\[[14](https://arxiv.org/html/2606.11417#bib.bib14)\], and its learned\-agent form is reward hacking or specification gaming\[[11](https://arxiv.org/html/2606.11417#bib.bib11),[13](https://arxiv.org/html/2606.11417#bib.bib13),[18](https://arxiv.org/html/2606.11417#bib.bib18),[15](https://arxiv.org/html/2606.11417#bib.bib15)\]\. Most of this literature characterizes when a fixed proxy is unsafe to optimize\. We hold the measurement frame fixed and ask a quantitative question: when can a progress signal exceed true improvement, and by how much? Budgeted Goodhart resistance answers this for compression\-progress rewards with a finite false\-positive budget\.

#### Intrinsic motivation and compression progress\.

Compression progress as a driver of curiosity and creativity originates with Schmidhuber\[[8](https://arxiv.org/html/2606.11417#bib.bib8),[9](https://arxiv.org/html/2606.11417#bib.bib9),[10](https://arxiv.org/html/2606.11417#bib.bib10)\]\. Prediction\-error and feature\-prediction bonuses\[[7](https://arxiv.org/html/2606.11417#bib.bib7)\]and random network distillation\[[3](https://arxiv.org/html/2606.11417#bib.bib3)\]are the standard deep reinforcement learning realizations\. These bonuses score error or novelty on the agent’s own stream\. Audit compression progress scores signed error reduction on a sealed distribution, which is what produces the endpoint accounting identity and the entropy floor\.

#### Adaptive data analysis and holdout reuse\.

Repeated queries against a finite validation set erode its guarantees; the reusable holdout and the Ladder mechanism bound this erosion\[[5](https://arxiv.org/html/2606.11417#bib.bib5),[2](https://arxiv.org/html/2606.11417#bib.bib2)\], and empirical studies measure it on real leaderboards and benchmarks\[[16](https://arxiv.org/html/2606.11417#bib.bib16),[17](https://arxiv.org/html/2606.11417#bib.bib17)\]\. The finite\-panel budget2​Δn2\\Delta\_\{n\}is the audit\-CP form of the same phenomenon, and our adaptive scalar\-feedback attack instantiates it together with the standard release defenses\.

#### Proper scoring rules\.

Log\-loss is a strictly proper scoring rule\[[12](https://arxiv.org/html/2606.11417#bib.bib12)\], so its population minimizer is the true conditional distribution\. The entropy floor \(Theorem[3](https://arxiv.org/html/2606.11417#Thmtheorem3)\) is the statement that this minimum equals the conditional entropy, which is why a purely random component carries only a finite compression\-progress budget\. The calibration probe separates this proper\-scoring signal from hard accuracy\.

## 3 Setup: audit compression progress

LetΘ\\Thetabe a class of model states and letℓ:Θ×𝒵→ℝ\\ell:\\Theta\\times\\mathcal\{Z\}\\to\\mathbb\{R\}be a bounded or lower\-bounded predictive loss\. In the log\-loss case,ℓ​\(θ,\(x,y\)\)=−log⁡pθ​\(y∣x\)\\ell\(\\theta,\(x,y\)\)=\-\\log p\_\{\\theta\}\(y\\mid x\)\. For experiments we use probability\-floored cross\-entropy,

ℓε​\(θ,\(x,y\)\)=−log⁡max⁡\{ε,pθ​\(y∣x\)\},\\ell\_\{\\varepsilon\}\(\\theta,\(x,y\)\)=\-\\log\\max\\\{\\varepsilon,p\_\{\\theta\}\(y\\mid x\)\\\},\(3\)which is bounded byR=−log⁡εR=\-\\log\\varepsilon\.

###### Definition 1\(Sealed audit loss\)\.

A sealed audit distribution𝖰\\mathsf\{Q\}is fixed independently of the agent’s adaptive training trajectory and cannot be selected, distorted, or inspected by the agent except through the permitted audit\-release mechanism\. The population audit loss is

ℰ​\(θ\)=𝔼z∼𝖰​ℓ​\(θ,z\)\.\\mathcal\{E\}\(\\theta\)=\\mathbb\{E\}\_\{z\\sim\\mathsf\{Q\}\}\\ell\(\\theta,z\)\.\(4\)For a finite audit panelAn=\(z1,…,zn\)A\_\{n\}=\(z\_\{1\},\\ldots,z\_\{n\}\), the empirical audit loss is

ℰ^n​\(θ\)=1n​∑i=1nℓ​\(θ,zi\)\.\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(\\theta,z\_\{i\}\)\.\(5\)

###### Definition 2\(Signed audit compression progress\)\.

Given a trajectoryθ0,θ1,…\\theta\_\{0\},\\theta\_\{1\},\\ldots, signed audit compression progress is

rtCP=ℰ​\(θt−1\)−ℰ​\(θt\),r^tCP=ℰ^n​\(θt−1\)−ℰ^n​\(θt\)\.r\_\{t\}^\{\\mathrm\{CP\}\}=\\mathcal\{E\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\(\\theta\_\{t\}\),\\qquad\\hat\{r\}\_\{t\}^\{\\mathrm\{CP\}\}=\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\_\{t\-1\}\)\-\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\_\{t\}\)\.\(6\)The sign is part of the definition: negative progress is charged back to the agent\.

###### Definition 3\(False\-positive budget\)\.

For a reward signalrtr\_\{t\}and true audit lossℰ\\mathcal\{E\}, define the reward excess at horizonTTby

ΓT​\(r,ℰ\)=∑t=1Trt−\(ℰ​\(θ0\)−ℰ​\(θT\)\)\.\\Gamma\_\{T\}\(r,\\mathcal\{E\}\)=\\sum\_\{t=1\}^\{T\}r\_\{t\}\-\\big\(\\mathcal\{E\}\(\\theta\_\{0\}\)\-\\mathcal\{E\}\(\\theta\_\{T\}\)\\big\)\.\(7\)A signal isΓ\\Gamma\-Goodhart\-resistant on a class of trajectories ifΓT​\(r,ℰ\)≤Γ\\Gamma\_\{T\}\(r,\\mathcal\{E\}\)\\leq\\Gammafor every horizonTTand every admissible trajectory in the class\. Exact audit\-CP hasΓ=0\\Gamma=0; finite\-panel audit\-CP hasΓ=2​Δn\\Gamma=2\\Delta\_\{n\}on the uniform\-deviation event\.

This condition is stronger than correlation with learning: it caps the apparent reward obtainable without true audit improvement\.

## 4 Theorems

### 4\.1 Exact sealed audits: zero false\-positive budget

###### Theorem 1\(Exact\-audit telescoping and finite budget\)\.

Letℰ:Θ→ℝ\\mathcal\{E\}:\\Theta\\to\\mathbb\{R\}and letθt\\theta\_\{t\}be any trajectory\. Definert=ℰ​\(θt−1\)−ℰ​\(θt\)r\_\{t\}=\\mathcal\{E\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\(\\theta\_\{t\}\)\. Then for every horizonTT,

∑t=1Trt=ℰ​\(θ0\)−ℰ​\(θT\)\.\\sum\_\{t=1\}^\{T\}r\_\{t\}=\\mathcal\{E\}\(\\theta\_\{0\}\)\-\\mathcal\{E\}\(\\theta\_\{T\}\)\.\(8\)Ifℰ​\(θT\)≥Emin\\mathcal\{E\}\(\\theta\_\{T\}\)\\geq E\_\{\\min\}, then

∑t=1Trt≤ℰ​\(θ0\)−Emin\.\\sum\_\{t=1\}^\{T\}r\_\{t\}\\leq\\mathcal\{E\}\(\\theta\_\{0\}\)\-E\_\{\\min\}\.\(9\)Thus no policy can make cumulative signed audit progress diverge while audit loss stagnates or remains lower\-bounded\.

###### Proof\.

The sum telescopes:

∑t=1T\(ℰ​\(θt−1\)−ℰ​\(θt\)\)=ℰ​\(θ0\)−ℰ​\(θT\)\.\\sum\_\{t=1\}^\{T\}\\big\(\\mathcal\{E\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\(\\theta\_\{t\}\)\\big\)=\\mathcal\{E\}\(\\theta\_\{0\}\)\-\\mathcal\{E\}\(\\theta\_\{T\}\)\.The lower\-bound statement follows immediately\. This proof is mechanized ascumCP\_telescope; the finite\-budget form iscumCP\_le\_of\_lb\. ∎

Each hypothesis of Theorem[1](https://arxiv.org/html/2606.11417#Thmtheorem1)is necessary: a fixed audit loss provides a single potential to telescope, signed accounting lets negative terms cancel, and a lower bound yields a finite budget\.

### 4\.2 Finite audits: the2​Δn2\\Delta\_\{n\}false\-positive budget

A reusable finite audit panel is not automatically sealed in the population sense; the relevant condition is a uniform\-deviation event\.

###### Definition 4\(Uniform audit deviation\)\.

For a classℱ⊆Θ\\mathcal\{F\}\\subseteq\\Theta, define

Δn​\(ℱ\)=supθ∈ℱ\|ℰ^n​\(θ\)−ℰ​\(θ\)\|\.\\Delta\_\{n\}\(\\mathcal\{F\}\)=\\sup\_\{\\theta\\in\\mathcal\{F\}\}\|\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\)\-\\mathcal\{E\}\(\\theta\)\|\.\(10\)We say the panel realizes deviationΔ\\Deltaonℱ\\mathcal\{F\}ifΔn​\(ℱ\)≤Δ\\Delta\_\{n\}\(\\mathcal\{F\}\)\\leq\\Delta\.

###### Theorem 2\(Finite\-audit Goodhart resistance\)\.

AssumeΔn​\(ℱ\)≤Δ\\Delta\_\{n\}\(\\mathcal\{F\}\)\\leq\\Deltaandθt∈ℱ\\theta\_\{t\}\\in\\mathcal\{F\}for allt≤Tt\\leq T\. Then

∑t=1Tr^tCP≤∑t=1TrtCP\+2​Δ=ℰ​\(θ0\)−ℰ​\(θT\)\+2​Δ\.\\sum\_\{t=1\}^\{T\}\\hat\{r\}\_\{t\}^\{\\mathrm\{CP\}\}\\leq\\sum\_\{t=1\}^\{T\}r\_\{t\}^\{\\mathrm\{CP\}\}\+2\\Delta=\\mathcal\{E\}\(\\theta\_\{0\}\)\-\\mathcal\{E\}\(\\theta\_\{T\}\)\+2\\Delta\.\(11\)Equivalently, empirical audit\-CP has false\-positive budget at most2​Δ2\\Delta\.

###### Proof\.

By telescoping,

∑tr^t=ℰ^n​\(θ0\)−ℰ^n​\(θT\),∑trt=ℰ​\(θ0\)−ℰ​\(θT\)\.\\sum\_\{t\}\\hat\{r\}\_\{t\}=\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\_\{0\}\)\-\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\_\{T\}\),\\qquad\\sum\_\{t\}r\_\{t\}=\\mathcal\{E\}\(\\theta\_\{0\}\)\-\\mathcal\{E\}\(\\theta\_\{T\}\)\.Uniform deviation controls only the two endpoints:

ℰ^n​\(θ0\)≤ℰ​\(θ0\)\+Δ,ℰ^n​\(θT\)≥ℰ​\(θT\)−Δ\.\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\_\{0\}\)\\leq\\mathcal\{E\}\(\\theta\_\{0\}\)\+\\Delta,\\qquad\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\_\{T\}\)\\geq\\mathcal\{E\}\(\\theta\_\{T\}\)\-\\Delta\.Combining gives the result\. This is mechanized asfinite\_audit\_goodhart\. ∎

There is no union bound overTT: after signed telescoping, the adaptive history reduces to endpoint control\. The cost of adaptivity appears in proving that the panel realizes a uniform\-deviation event for the reachable class; the theorem itself is horizon\-free\.

###### Corollary 1\(Finite experts\)\.

If\|ℱ\|=N\|\\mathcal\{F\}\|=Nandℓ∈\[0,R\]\\ell\\in\[0,R\], then with probability at least1−δ1\-\\deltaover an i\.i\.d\. audit panel,

Δn​\(ℱ\)≤R​log⁡\(2​N/δ\)2​n,\\Delta\_\{n\}\(\\mathcal\{F\}\)\\leq R\\sqrt\{\\frac\{\\log\(2N/\\delta\)\}\{2n\}\},\(12\)so

∑t=1Tr^tCP≤ℰ​\(θ0\)−ℰ​\(θT\)\+2​R​log⁡\(2​N/δ\)2​n\.\\sum\_\{t=1\}^\{T\}\\hat\{r\}\_\{t\}^\{\\mathrm\{CP\}\}\\leq\\mathcal\{E\}\(\\theta\_\{0\}\)\-\\mathcal\{E\}\(\\theta\_\{T\}\)\+2R\\sqrt\{\\frac\{\\log\(2N/\\delta\)\}\{2n\}\}\.\(13\)

###### Proof\.

Apply two\-sided Hoeffding to each fixed model and union bound overℱ\\mathcal\{F\}; then invoke Theorem[2](https://arxiv.org/html/2606.11417#Thmtheorem2)\. The Lean artifact mechanizes the deterministic implication from a realized uniform\-deviation event to Goodhart resistance; this probabilistic instantiation is the standard finite\-class concentration corollary\. ∎

For infinite classes, replaceNNby the appropriate covering number, Rademacher complexity, or PAC\-Bayesian radius\. In particular, bounded linear balls and bounded RKHS balls produce the same form: a finite false\-positive budget whenever the effective audit capacity is finite\. General neural networks enter the theory only through their effective class size or stability\. If the reachable class is large enough to memorize the reusable audit panel, the bound becomes vacuous; this delimits where finite\-audit resistance ceases to hold\.

### 4\.3 Entropy floor: why noisy TV cannot pay forever

Prediction\-error bonuses pay for error itself and are therefore attracted to irreducible noise\. Compression progress pays for error reduction\. For log\-loss, the distinction is formalized by the following entropy floor\.

###### Theorem 3\(Entropy floor\)\.

LetSSbe an audit component with conditional distribution𝖰S​\(Y∣X\)\\mathsf\{Q\}\_\{S\}\(Y\\mid X\)and log\-loss riskℰS​\(θ\)\\mathcal\{E\}\_\{S\}\(\\theta\)\. IfℰS​\(θ\)≥H𝖰S​\(Y∣X\)\\mathcal\{E\}\_\{S\}\(\\theta\)\\geq H\_\{\\mathsf\{Q\}\_\{S\}\}\(Y\\mid X\)for allθ\\theta, then signed compression progress onSSsatisfies

∑t=1T\(ℰS​\(θt−1\)−ℰS​\(θt\)\)≤ℰS​\(θ0\)−H𝖰S​\(Y∣X\)\.\\sum\_\{t=1\}^\{T\}\\big\(\\mathcal\{E\}\_\{S\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\_\{S\}\(\\theta\_\{t\}\)\\big\)\\leq\\mathcal\{E\}\_\{S\}\(\\theta\_\{0\}\)\-H\_\{\\mathsf\{Q\}\_\{S\}\}\(Y\\mid X\)\.\(14\)A purely random component has only a finite improvement budget; once the model reaches the entropy floor, it cannot keep paying audit\-CP\.

###### Proof\.

For log\-loss,ℰS​\(θ\)=H𝖰S​\(Y∣X\)\+KL​\(𝖰S∥pθ\)\\mathcal\{E\}\_\{S\}\(\\theta\)=H\_\{\\mathsf\{Q\}\_\{S\}\}\(Y\\mid X\)\+\\mathrm\{KL\}\(\\mathsf\{Q\}\_\{S\}\\\|p\_\{\\theta\}\)andKL≥0\\mathrm\{KL\}\\geq 0, henceℰS​\(θ\)≥H𝖰S​\(Y∣X\)\\mathcal\{E\}\_\{S\}\(\\theta\)\\geq H\_\{\\mathsf\{Q\}\_\{S\}\}\(Y\\mid X\), and the budget follows from Theorem[1](https://arxiv.org/html/2606.11417#Thmtheorem1)\. This composition is mechanized end to end in the conditional, input\-averaged form: the cross\-entropy decomposition and the finite Gibbs inequality discharge the conditional entropy floor, which is then fed into the telescoping budget\. The corresponding Lean declarations are listed in Table[4](https://arxiv.org/html/2606.11417#S8.T4); the single\-input case specializes the same chain, and an abstract version takes the floor as a hypothesis for an arbitrary lower\-bounded potential\. ∎

## 5 Where the theorem breaks

Each assumption above has a failure construction and a corresponding experiment\.

#### Clipping destroys accounting\.

If reward isrt\+=max⁡\{0,ℰ​\(θt−1\)−ℰ​\(θt\)\}r\_\{t\}^\{\+\}=\\max\\\{0,\\mathcal\{E\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\(\\theta\_\{t\}\)\\\}, then a two\-state cyclea,b,a,b,…a,b,a,b,\\ldotswithℰ​\(a\)<ℰ​\(b\)\\mathcal\{E\}\(a\)<\\mathcal\{E\}\(b\)accumulates positive reward on everyb→ab\\to atransition while returning to the same endpoint every two steps\. Thus∑trt\+\\sum\_\{t\}r\_\{t\}^\{\+\}can grow linearly at zero net improvement\. Signed progress cancels exactly\.

#### Stream scoring destroys the fixed potential\.

If the agent is rewarded byℰ𝖯t​\(θt−1\)−ℰ𝖯t​\(θt\)\\mathcal\{E\}\_\{\\mathsf\{P\}\_\{t\}\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\_\{\\mathsf\{P\}\_\{t\}\}\(\\theta\_\{t\}\)on its own selected stream𝖯t\\mathsf\{P\}\_\{t\}, then there is no fixedℰ\\mathcal\{E\}to telescope\. A policy can select or distort streams where local loss decreases while sealed\-audit loss does not\.

#### High\-capacity reusable panels destroy uniform deviation\.

Ifℱ\\mathcal\{F\}can interpolate the finite audit panel, a policy can driveℰ^n\\widehat\{\\mathcal\{E\}\}\_\{n\}down whileℰ\\mathcal\{E\}stagnates or rises\. This is ordinary adaptive holdout overfitting\[[5](https://arxiv.org/html/2606.11417#bib.bib5),[2](https://arxiv.org/html/2606.11417#bib.bib2)\], now expressed as false audit\-CP\. The2​Δn2\\Delta\_\{n\}theorem remains correct, butΔn\\Delta\_\{n\}is no longer small\.

![Refer to caption](https://arxiv.org/html/2606.11417v1/x2.png)Figure 2:Boundary tests from the experiment artifact\. Stream scoring: stream\-scored progress can exceed sealed\-audit progress by roughly40×40\\times\. Clipping: clipped reward farms a positive cumulative signal while signed reward equals endpoint change\. Panel memorization: direct reusable\-panel training yields positive apparent progress and negative true progress\. Calibration: temperature scaling changes log\-loss/ECE at fixed hard accuracy, emphasizing that compression progress measures probabilistic prediction\.

## 6 Algorithmic separation: reward signal versus scheduler

We distinguish the reward signal from the scheduler\.

#### Audit\-CP is the reward/accounting signal\.

For each candidate training sourcejj, temporarily train or estimate the update effect, evaluate the sealed\-audit loss before and after, and release the signed reward

rt,j=ℰ^n​\(θt\)−ℰ^n​\(Uj​\(θt\)\),r\_\{t,j\}=\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\_\{t\}\)\-\\widehat\{\\mathcal\{E\}\}\_\{n\}\(U\_\{j\}\(\\theta\_\{t\}\)\),\(15\)whereUjU\_\{j\}denotes the update produced by sourcejj\. This reward is meaningful because Theorem[2](https://arxiv.org/html/2606.11417#Thmtheorem2)controls its cumulative false positives\.

#### EXP3/MWU is the scheduler\.

We use multiplicative weights / EXP3 to allocate training among sources\. With weightswj,tw\_\{j,t\}and exploration parameterγ\\gamma,

pj,t=\(1−γ\)​wj,t∑kwk,t\+γK,wj,t\+1=wj,t​exp⁡\(η​rt,j​𝟏​\{j=jt\}pj,t\)\.p\_\{j,t\}=\(1\-\\gamma\)\\frac\{w\_\{j,t\}\}\{\\sum\_\{k\}w\_\{k,t\}\}\+\\frac\{\\gamma\}\{K\},\\qquad w\_\{j,t\+1\}=w\_\{j,t\}\\exp\\left\(\\eta\\frac\{r\_\{t,j\}\\mathbf\{1\}\\\{j=j\_\{t\}\\\}\}\{p\_\{j,t\}\}\\right\)\.\(16\)This is the standard adversarial\-bandit use of multiplicative weights\[[1](https://arxiv.org/html/2606.11417#bib.bib1)\], providing exploration and exploitation\. The credibility of the cumulative reward comes from the reward signal: if the payoff fed to MWU is a hackable stream reward, MWU efficiently optimizes the hack; if the payoff is signed audit\-CP, Theorem[2](https://arxiv.org/html/2606.11417#Thmtheorem2)supplies the credibility certificate\.

Audit\-CP scheduler protocol\.Fort=1,…,Tt=1,\\ldots,T: choose a sourcejtj\_\{t\}with an EXP3/MWU policy; train the model on fresh samples from sourcejtj\_\{t\}; compute signed audit loss change on the sealed panel; update the scheduler using the signed audit\-CP reward\. Negative rewards are retained\. The audit panel is never used as training data and the released signal is protected by the chosen audit\-release mechanism\.

## 7 Experiments

### 7\.1 Experimental substrate

All reported numbers come from the accompanying experiment artifact\. The curriculum substrate is ARC\-TGI, a generator framework for ARC\-style grid transformation tasks; its released generator inventory includes ARC\-Mini, ARC\-AGI\-1, and ARC\-AGI\-2 families\[[6](https://arxiv.org/html/2606.11417#bib.bib6)\]\. Curriculum experiments require repeated fresh samples from a stable latent rule; a single static puzzle instance is insufficient\. The broader ARC\-AGI\-2 benchmark is motivated by few\-shot abstraction and refinement\-loop behavior\[[4](https://arxiv.org/html/2606.11417#bib.bib4)\]\.

The experiment suite uses30×3030\\times 30padded grids, 24 learnable ARC\-TGI grid\-to\-grid families, 8 i\.i\.d\. uniform distractor families, probability\-floored cross\-entropy withε=0\.01\\varepsilon=0\.01and capR≈4\.6R\\approx 4\.6, audit panels of size 512 per family unless otherwise stated, 5000 training steps, and 20 seeds for every experiment except the adaptive attack\. The adaptive scalar\-feedback attack reports its single\-cell power calibration and panel\-size scaling curve over 20 seeds each; the full\-scale grid attack, in which panel capacity far exceeds the query budget, uses 12 seeds\. Baselines include prediction\-error curiosity, real RND with a fixed random target network and learned predictor\[[3](https://arxiv.org/html/2606.11417#bib.bib3)\], real ICM with inverse\-model features and forward prediction error\[[7](https://arxiv.org/html/2606.11417#bib.bib7)\], uniform sampling, round\-robin, and an oracle that samples only learnable families\.

### 7\.2 Main result table

Table 1:Experiment suite\. Each row tests one theorem assumption: concentration, noisy\-TV entropy, adaptive holdout release, signed accounting, stream/audit separation, capacity, and metric identity\. Reported uncertainties are one standard deviation across the seeds \(20 unless otherwise noted\)\.
### 7\.3 Finite\-audit concentration

The finite\-audit theorem is useful only ifΔn\\Delta\_\{n\}is small for the reachable class\. We measure empirical uniform deviation over finite model families and vary panel size\.

![Refer to caption](https://arxiv.org/html/2606.11417v1/x3.png)Figure 3:Finite\-audit concentration: empirical uniform deviation of floored cross\-entropy over a finite family decays asΔn∝n−0\.527\\Delta\_\{n\}\\propto n^\{\-0\.527\}, close to then−1/2n^\{\-1/2\}prediction\.The observed slope is−0\.527\-0\.527over 20 seeds, matching the predictedn−1/2n^\{\-1/2\}rate; the Hoeffding constant itself is loose\. The false\-positive budget is therefore an empirically measurable quantity\.

### 7\.4 Reward\-signal ablation under noise distractors

The ablation holds the scheduler fixed and varies only the reward signal\. The win condition is downstream active\-cell reconstruction accuracy on a sealed audit panel, plus low allocation to i\.i\.d\. noise distractors\.

![Refer to caption](https://arxiv.org/html/2606.11417v1/x4.png)Figure 4:Reward\-signal ablation with the same scheduler\. Final active\-cell reconstruction accuracy \(left axis\) and distractor allocation fraction \(right axis\) per reward signal, 20 seeds\. Audit\-CP is the strongest non\-oracle reward signal and spends less than half the distractor budget of prediction\-error curiosity\.Table 2:Reward\-signal ablation, final metrics, 20 seeds\. Audit\-CP beats prediction\-error by\+0\.0498\+0\.0498and ICM by\+0\.0407\+0\.0407in 20/20 seeds, edges RND by\+0\.0159\+0\.0159in 17/20 seeds, and approaches the learnable\-only oracle\.Prediction error is Goodharted by noise: it spends 50\.4% of its sampling budget on distractors\. Audit\-CP spends 21\.9%, remains above uniform and round\-robin, and approaches the oracle\. Audit\-CP allocates toward tasks whose error compresses, independent of task difficulty\.

![Refer to caption](https://arxiv.org/html/2606.11417v1/x5.png)Figure 5:Sealed\-panel learning curves for an audit\-CP run\. Audit NLL \(left axis\) falls, equivalently cumulative audit compression progress rises, while cell and active\-cell reconstruction accuracy \(right axis\) improve over training\. The signed audit\-CP signal tracks downstream reconstruction\. One\-step log\-loss changes differ from one\-step hard\-accuracy changes; the reward\-signal ablation reports both metrics\.
### 7\.5 Adaptive scalar\-feedback holdout overfitting

In the black\-box scalar\-feedback setting, the attacker sees only released audit scores and proposes adaptive model updates\. This is the holdout\-reuse problem applied to audit\-CP\.

![Refer to caption](https://arxiv.org/html/2606.11417v1/x6.png)Figure 6:Adaptive scalar\-feedback overfitting \(single\-cell calibration, 20 seeds; error bars are one standard deviation\)\. Left: the attacker has power against naive reusable release, with gap3\.24±0\.07\>2​Δn=1\.643\.24\\pm 0\.07\>2\\Delta\_\{n\}=1\.64and a win in all 20 seeds\. Fresh subsampling, laddering, rounding, and one\-shot release keep the gap below threshold\. Right: the gap falls below threshold once panel capacity exceeds the query budget, crossing in20/2020/20seeds atn≤800n\\leq 800and in0/200/20atn≥3200n\\geq 3200\.Table 3:Holdout\-attack defense calibration atQ=8000Q=8000,n=200n=200, single\-cell attack, 20 seeds\. The overfit gap is reusable minus fresh audit\-CP; the attacker wins when the gap exceeds the finite\-audit threshold2​Δn2\\Delta\_\{n\}\. Naive reuse is exploitable in all 20 seeds; the standard release defenses suppress the scalar\-feedback attack below threshold\.This attack maps the boundary of finite audit\-CP: reusing a panel and releasing too much information lets an adaptive agent overfit it, whereas defenses that reduce information leakage keep the scalar\-feedback gap below the validity threshold\. The2​Δn2\\Delta\_\{n\}threshold is therefore operational\.

### 7\.6 Boundary and metric tests

Figure[2](https://arxiv.org/html/2606.11417#S5.F2)collects the remaining assumption tests\. Signed accounting is load\-bearing: signed cumulative progress equals endpoint improvement−0\.072±0\.042\-0\.072\\pm 0\.042, while clipped reward accumulates1\.417±0\.0881\.417\\pm 0\.088\. The stream/audit separation holds: stream\-scored CP is29\.029\.0to30\.030\.0while sealed\-audit CP is0\.720\.72to0\.740\.74fork∈\{0,0\.5\}k\\in\\\{0,0\.5\\\}, and atk=1k=1the sealed audit goes slightly negative\. The capacity boundary is real: direct training on a reusable panel gives apparent CP around\+2\.39\+2\.39while true fresh\-audit CP is negative\. Calibration and discrimination decouple: log\-loss and ECE can change with no change in hard accuracy, so compression progress should be read as probabilistic predictive improvement, not merely 0\-1 discrimination\.

## 8 Lean 4 mechanization

The formal artifact is a self\-contained Lean 4 / Mathlib development\. The core declarations are listed in Table[4](https://arxiv.org/html/2606.11417#S8.T4)\. The two structural theorems are stated for an arbitrary potentialℰ:ι→ℝ\\mathcal\{E\}:\\iota\\to\\mathbb\{R\}; cross\-entropy and finite PMFs instantiate the entropy\-floor theorem\.

![Refer to caption](https://arxiv.org/html/2606.11417v1/x7.png)Figure 7:Proof architecture\. Signed audit progress is an endpoint potential drop, which keeps the mechanized theorems short\. The finite\-class and ARC experiments instantiate the structural result; the boundary experiments mark assumptions that cannot be dropped\.Table 4:Lean proof integration\. The deterministic and information\-theoretic core is mechanized\. The finite\-class concentration corollary uses the standard Hoeffding\-plus\-union\-bound argument to realize the uniform\-deviation premise offinite\_audit\_goodhart\.The finite\-audit bound is mechanized as follows:

theoremfinite\_audit\_goodhart\{iota:Type\*\}\{F:Setiota\}

\{EhatE:iota\-\>Real\}\{Delta:Real\}

\(hUD:UniformDevFEhatEDelta\)\{g:Nat\-\>iota\}

\(hg:AdmissiblegF\)\(T:Nat\):

cumCPEhatgT<=cumCPEgT\+2\*Delta:=by

rw\[cumCP\_telescopeEhatgT,cumCP\_telescopeEgT\]

haveh0:=abs\_le\.1\(hUD\(g0\)\(hg0\)\)

havehT:=abs\_le\.1\(hUD\(gT\)\(hgT\)\)

linarith\[h0\.1,h0\.2,hT\.1,hT\.2\]

Once signed empirical CP telescopes, only the endpoints matter: the uniform\-deviation event is used exactly twice, atg​\(0\)g\(0\)andg​\(T\)g\(T\)\. The bound is therefore horizon\-free\.

## 9 Discussion

#### Relation to compression\-progress folklore\.

The informal compression\-progress principle rewards improvements in the compressor\. Our theorem states the conditions under which this is credible: the compressor must be evaluated by a fixed audit loss, the reward must be signed, and finite reuse must be controlled by a uniform\-deviation budget\.

#### Multiplicative weights versus the audit certificate\.

Multiplicative weights has regret and allocation guarantees, but its potential is not a predictive\-performance metric: it can reweight tasks without reducing audit loss\. The credibility certificate is therefore audit\-CP, and MWU/EXP3 is a scheduler that consumes it\.

#### Log\-loss versus hard accuracy\.

Compression is probabilistic\. Proper scoring losses reward calibrated beliefs, not only argmax decisions\. Calibration can improve at fixed hard accuracy\. For reconstruction tasks we report active\-cell accuracy for interpretability; the certified signal is audit cross\-entropy\.

## 10 Limitations

The guarantee is conditional: it assumes a sealed audit, a reachable model class of finite effective capacity, and an audit the agent cannot alter, the boundaries probed by the stream\-scoring, panel\-memorization, and holdout\-reuse experiments\. The deterministic and information\-theoretic core is mechanized in Lean; the finite\-class concentration corollary is the standard Hoeffding\-plus\-union argument, stated but not mechanized\. The empirical claims are scoped to match: audit\-CP leads the non\-oracle baselines only narrowly in mean active\-cell accuracy, so its robust separation is the distractor\-allocation budget; the unchanged accuracy in the calibration probe is an identity of temperature rescaling, with the calibration signal carried by NLL and ECE; and the scalar\-feedback attack crosses threshold in the single\-cell calibration while the deployed\-scale panel stays below it\.

## 11 Conclusion

Compression progress becomes a Goodhart\-resistant RSI signal only after a precise measurement choice: score*signed*progress on a*sealed audit*distribution\. The resulting reward has an endpoint accounting identity and, for finite panels, a2​Δn2\\Delta\_\{n\}false\-positive budget\. The same budget accounts for the failure modes: prediction\-error curiosity is attracted to irreducible noise; clipped progress farms forgetting/relearning cycles; stream progress optimizes the wrong distribution; high\-capacity reusable audits can be overfit\. Audit\-CP avoids these failures exactly when the audit remains sealed and its uniform\-deviation budget is finite\.

The resulting design rule for intrinsic\-motivation and self\-improvement systems is: use compression progress as the audited reward, use a separate scheduler for exploration, retain negative progress, and treat the audit as a protected measurement instrument\. Under these conditions, apparent improvement is bounded by true audit improvement plus the panel’s deviation budget\.

## Ethics statement

This work studies an accounting signal for genuine capability improvement, motivated by reward design for continual learning and recursive self\-improvement\. The contribution is a measurement condition, not a safety guarantee\. The telescoping identity and the2​Δn2\\Delta\_\{n\}budget certify that signed compression progress on a sealed audit cannot overstate true audit improvement by more than a finite amount\. They do not certify that the audit distribution captures everything a deployer cares about, nor that an agent which satisfies the budget is safe\. We state the boundary conditions explicitly, namely clipping, stream scoring, high\-capacity reusable panels, and adaptive feedback, so that the signal is not read as a stronger guarantee than it is\.

The adaptive scalar\-feedback attack reproduces a known adaptive\-data\-analysis vulnerability: repeated scalar feedback from a reusable panel can be overfit\. We include it to calibrate the test and to show that standard release defenses suppress it, not to introduce a new exploit\. All experiments use synthetic ARC\-TGI grid\-transformation tasks and contain no human\-subject or personal data\. The principal societal risk is overclaiming: deploying audit\-CP without a sealed audit and a finite uniform\-deviation budget could create false assurance of genuine improvement\. The boundary results are included to make that failure mode visible\.

## Reproducibility statement

The numerical claims in this paper are computed from the accompanying experiment artifact containing scripts, logs, per\-seed JSON shards, summary tables, and vector figures\. The formal claims in Table[4](https://arxiv.org/html/2606.11417#S8.T4)are represented by the accompanying Lean 4 artifact with a pinned Mathlib dependency and an axiom\-audit target\. The public source package for this paper uses only these two artifacts as inputs for results and proof declarations\.

## Appendix AAdditional theorem details

### A\.1 Finite\-class concentration

LetXiθ=ℓ​\(θ,zi\)X\_\{i\}^\{\\theta\}=\\ell\(\\theta,z\_\{i\}\)forzi∼𝖰z\_\{i\}\\sim\\mathsf\{Q\}, withXiθ∈\[0,R\]X\_\{i\}^\{\\theta\}\\in\[0,R\]\. For a fixedθ\\theta, Hoeffding gives

Pr⁡\(\|ℰ^n​\(θ\)−ℰ​\(θ\)\|\>ϵ\)≤2​exp⁡\(−2​n​ϵ2R2\)\.\\Pr\\left\(\|\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\)\-\\mathcal\{E\}\(\\theta\)\|\>\\epsilon\\right\)\\leq 2\\exp\\left\(\-\\frac\{2n\\epsilon^\{2\}\}\{R^\{2\}\}\\right\)\.\(17\)Union bounding overNNexperts gives

Pr⁡\(supθ∈ℱ\|ℰ^n​\(θ\)−ℰ​\(θ\)\|\>ϵ\)≤2​N​exp⁡\(−2​n​ϵ2R2\)\.\\Pr\\left\(\\sup\_\{\\theta\\in\\mathcal\{F\}\}\|\\widehat\{\\mathcal\{E\}\}\_\{n\}\(\\theta\)\-\\mathcal\{E\}\(\\theta\)\|\>\\epsilon\\right\)\\leq 2N\\exp\\left\(\-\\frac\{2n\\epsilon^\{2\}\}\{R^\{2\}\}\\right\)\.\(18\)Setting the right\-hand side toδ\\deltayields Corollary[1](https://arxiv.org/html/2606.11417#Thmcorollary1)\. Combining with Theorem[2](https://arxiv.org/html/2606.11417#Thmtheorem2)gives the finite\-expert false\-positive budget\.

### A\.2 Covering\-number extension

For a bounded Lipschitz loss familyℓθ\\ell\_\{\\theta\}and anϵ\\epsilon\-cover of sizeN​\(ϵ,ℱ,d\)N\(\\epsilon,\\mathcal\{F\},d\)in a metric that controls loss uniformly, the same argument yields

Δn​\(ℱ\)≲ϵ\+R​log⁡N​\(ϵ,ℱ,d\)\+log⁡\(1/δ\)n\.\\Delta\_\{n\}\(\\mathcal\{F\}\)\\lesssim\\epsilon\+R\\sqrt\{\\frac\{\\log N\(\\epsilon,\\mathcal\{F\},d\)\+\\log\(1/\\delta\)\}\{n\}\}\.\(19\)For bounded linear classes indddimensions,log⁡N​\(ϵ\)\\log N\(\\epsilon\)scales asd​log⁡\(B/ϵ\)d\\log\(B/\\epsilon\)under standard norm constraints\. For bounded RKHS balls, the analogous statement is controlled by the corresponding metric entropy or effective dimension\. The paper’s claims do not require a novel concentration theorem; they require recognizing that any such theorem supplies the exact same false\-positive budget in Theorem[2](https://arxiv.org/html/2606.11417#Thmtheorem2)\.

### A\.3 Boundary constructions

###### Proposition 1\(Clipped\-cycle construction\)\.

Leta,b∈Θa,b\\in\\Thetawithℰ​\(a\)<ℰ​\(b\)\\mathcal\{E\}\(a\)<\\mathcal\{E\}\(b\), and defineθt=a\\theta\_\{t\}=afor eventtandθt=b\\theta\_\{t\}=bfor oddtt\. Then signed cumulative progress over every even horizon is zero, but clipped cumulative progress grows linearly inTT\.

###### Proof\.

Each two\-step cycle has signed reward\(ℰ​\(a\)−ℰ​\(b\)\)\+\(ℰ​\(b\)−ℰ​\(a\)\)=0\(\\mathcal\{E\}\(a\)\-\\mathcal\{E\}\(b\)\)\+\(\\mathcal\{E\}\(b\)\-\\mathcal\{E\}\(a\)\)=0\. The clipped reward for thea→ba\\to bstep is0and for theb→ab\\to astep isℰ​\(b\)−ℰ​\(a\)\>0\\mathcal\{E\}\(b\)\-\\mathcal\{E\}\(a\)\>0\. Thus every two\-step cycle pays positive clipped reward at zero net endpoint improvement\. ∎

###### Proposition 2\(Stream\-audit separation\)\.

There exist lossesℰ𝖯t\\mathcal\{E\}\_\{\\mathsf\{P\}\_\{t\}\}andℰ𝖰\\mathcal\{E\}\_\{\\mathsf\{Q\}\}and a trajectoryθt\\theta\_\{t\}such thatℰ𝖯t​\(θt−1\)−ℰ𝖯t​\(θt\)\>0\\mathcal\{E\}\_\{\\mathsf\{P\}\_\{t\}\}\(\\theta\_\{t\-1\}\)\-\\mathcal\{E\}\_\{\\mathsf\{P\}\_\{t\}\}\(\\theta\_\{t\}\)\>0for allttwhileℰ𝖰​\(θT\)≥ℰ𝖰​\(θ0\)\\mathcal\{E\}\_\{\\mathsf\{Q\}\}\(\\theta\_\{T\}\)\\geq\\mathcal\{E\}\_\{\\mathsf\{Q\}\}\(\\theta\_\{0\}\)\.

###### Proof\.

Take a two\-coordinate linear prediction problem\. Let the audit loss depend only on coordinate 1 and the stream loss at timettdepend only on coordinate 2\. A policy that improves coordinate 2 while degrading coordinate 1 has positive stream progress and non\-improving audit loss\. The stream\-versus\-audit comparison instantiates this separation in the ARC\-style setting\. ∎

###### Proposition 3\(Reusable\-panel memorization\)\.

If a model class can interpolate arbitrary labels on the finite audit panel while assigning poor probabilities off\-panel, then empirical audit loss can be driven to zero while population audit loss remains high\.

###### Proof\.

Choose a model that memorizes each panel point and behaves as a bad predictor elsewhere\. Since the panel has measure zero under a continuous population distribution, empirical loss can vanish while population loss is governed by off\-panel behavior\. In discrete settings, the same construction holds when the panel is a small subset of the support and the predictor is bad on the complement\. The panel\-memorization test and the holdout\-reuse attack probe the white\-box and scalar\-feedback forms of this phenomenon\. ∎

## Appendix BLean declaration excerpts

### B\.1 Telescoping and finite budget

theoremcumCP\_telescope\{iota:Type\*\}\(E:iota\-\>Real\)

\(g:Nat\-\>iota\)\(T:Nat\):

cumCPEgT=E\(g0\)\-E\(gT\):=by

inductionTwith

\|zero=\>simp\[cumCP\]

\|succnih=\>

simponly\[cumCP,Finset\.sum\_range\_succ\]atih\|\-

rw\[ih\];ring

theoremcumCP\_le\_of\_lb\{iota:Type\*\}\(E:iota\-\>Real\)

\(g:Nat\-\>iota\)\(T:Nat\)\{Emin:Real\}

\(hlb:Emin<=E\(gT\)\):

cumCPEgT<=E\(g0\)\-Emin:=by

rw\[cumCP\_telescope\];linarith

### B\.2 Entropy floor

theoremklDivFinitePMF\_nonneg\{H:Type\*\}\[FintypeH\]

\(QP:FinitePMFH\)\[HasPositivePriorP\]:

0<=klDivFinitePMFQP:=by

\.\.\.

\-\-T3asstatedinthepaper:theCONDITIONAL\(input\-averaged\)entropyfloor\.Thefloor

\-\-H\(Y\|X\)<=E\_xH\(Q,P\(gT\)\)isDISCHARGEDbyconditionalGibbs

\-\-\(condEntropy\_le\_condCrossEntropy\),nottakenasahypothesis\.Theunconditional

\-\-entropy\_floor\_budget\_crossEntropyisthesingle\-inputspecialcase\.

theoremcond\_entropy\_floor\_budget\_crossEntropy\{iotaXH:Type\*\}\[FintypeX\]\[FintypeH\]

\(mu:FinitePMFX\)\(Q:X\-\>FinitePMFH\)\(P:iota\-\>X\-\>FinitePMFH\)

\[forallt,forallx,HasPositivePrior\(Ptx\)\]\(g:Nat\-\>iota\)\(T:Nat\):

cumCP\(funt=\>condCrossEntropyFinitePMFmuQ\(Pt\)\)gT

<=condCrossEntropyFinitePMFmuQ\(P\(g0\)\)\-condEntropyFinitePMFmuQ:=

cumCP\_le\_of\_lb\_gT\(condEntropy\_le\_condCrossEntropymuQ\(P\(gT\)\)\)

## Appendix CExperiment protocol summary

#### Reward\-signal ablation\.

The task source consists of 24 learnable ARC\-TGI grid\-to\-grid generator families and 8 i\.i\.d\. uniform distractor families\. The model is trained for 5000 steps\. The scheduler is held fixed across reward variants\. Reward variants are signed audit\-CP, prediction error, RND, ICM, uniform, round\-robin, and learnable\-only oracle\. Metrics are active\-cell audit accuracy, floored audit cross\-entropy, and distractor allocation fraction\.

#### Finite\-audit concentration\.

Sample finite families of predictors and audit panels at severalnn\. Compute empirical uniform deviation between panel loss and a larger fresh\-audit estimate\. Fit a log\-log slope of deviation againstnn\.

#### Scalar\-feedback attack\.

An adaptive attacker receives scalar audit feedback from a reusable panel and selects candidate updates\. We compare naive reusable release against fresh subsampling, ladder release, rounded release, and one\-shot release\. The reported gap is reusable CP minus fresh CP; success is crossing2​Δn2\\Delta\_\{n\}\.

#### Boundary tests\.

The stream\-versus\-audit comparison contrasts stream\-scored CP with sealed\-audit CP under increasing stream bias\. The clipping comparison cycles between states to compare signed and clipped progress\. The panel\-memorization test trains directly on a reusable panel to probe high\-capacity memorization\. The calibration probe applies temperature scaling and checks that hard accuracy is invariant while NLL/ECE move\.

## References

- \[1\]Peter Auer, Nicolo Cesa\-Bianchi, Yoav Freund, and Robert E\. Schapire\.The nonstochastic multiarmed bandit problem\.*SIAM Journal on Computing*, 32\(1\):48–77, 2002\.
- \[2\]Avrim Blum and Moritz Hardt\.The Ladder: A reliable leaderboard for machine learning competitions\.arXiv:1502\.04585, 2015\.
- \[3\]Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov\.Exploration by random network distillation\.arXiv:1810\.12894, 2018\.
- \[4\]François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers\.ARC Prize 2025: Technical report\.arXiv:2601\.10904, 2026\.
- \[5\]Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth\.Generalization in adaptive data analysis and holdout reuse\.arXiv:1506\.02629, 2015\.
- \[6\]Jens Lehmann, Syeda Khushbakht, Nikoo Salehfard, Nur A\. Zarin Nishat, Dhananjay Bhandiwad, Andrei Aioanei, and Sahar Vahdati\.ARC\-TGI: Human\-validated task generators with reasoning chain templates for ARC\-AGI\.arXiv:2603\.05099, 2026\.
- \[7\]Deepak Pathak, Pulkit Agrawal, Alexei A\. Efros, and Trevor Darrell\.Curiosity\-driven exploration by self\-supervised prediction\.arXiv:1705\.05363, 2017\.
- \[8\]Jürgen Schmidhuber\.A possibility for implementing curiosity and boredom in model\-building neural controllers\.In*From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior*, MIT Press/Bradford Books, 1991\.
- \[9\]Jürgen Schmidhuber\.Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes\.arXiv:0812\.4360, 2008\.
- \[10\]Jürgen Schmidhuber\.Formal theory of creativity, fun, and intrinsic motivation \(1990–2010\)\.*IEEE Transactions on Autonomous Mental Development*, 2\(3\):230–247, 2010\.
- \[11\]Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané\.Concrete problems in AI safety\.arXiv:1606\.06565, 2016\.
- \[12\]Tilmann Gneiting and Adrian E\. Raftery\.Strictly proper scoring rules, prediction, and estimation\.*Journal of the American Statistical Association*, 102\(477\):359–378, 2007\.
- \[13\]Joel Lehman, Jeff Clune, Dusan Misevic, and others\.The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities\.*Artificial Life*, 26\(2\):274–306, 2020\.
- \[14\]David Manheim and Scott Garrabrant\.Categorizing variants of Goodhart’s law\.arXiv:1803\.04585, 2018\.
- \[15\]Alexander Pan, Kush Bhatia, and Jacob Steinhardt\.The effects of reward misspecification: Mapping and mitigating misaligned models\.In*International Conference on Learning Representations \(ICLR\)*, 2022\.
- \[16\]Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar\.Do ImageNet classifiers generalize to ImageNet?In*International Conference on Machine Learning \(ICML\)*, 2019\.
- \[17\]Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich\-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt\.A meta\-analysis of overfitting in machine learning\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2019\.
- \[18\]Joar Skalse, Nikolaus H\. R\. Howe, Dmitrii Krasheninnikov, and David Krueger\.Defining and characterizing reward hacking\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.

Similar Articles

Measuring Goodhart’s law

OpenAI Blog

OpenAI research formally analyzes Goodhart's law through best-of-n sampling, providing efficient estimators for measuring how well proxy objectives track true objectives and quantifying optimization effort via KL divergence.

Reward Hacking in Rubric-Based Reinforcement Learning

Hugging Face Daily Papers

This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.