Detecting and Suppressing Reward Hacking with Gradient Fingerprints
Summary
This paper introduces Gradient Fingerprint (GRIFT), a method for detecting reward hacking in reinforcement learning with verifiable rewards by analyzing models' internal gradient computations rather than surface-level reasoning traces. The approach achieves over 25% relative improvement in detecting implicit reward-hacking behaviors across math, code, and logical reasoning benchmarks.
View Cached Full Text
Cached at: 04/20/26, 08:31 AM
# Detecting and Suppressing Reward Hacking with Gradient Fingerprints
Source: https://arxiv.org/html/2604.16242
Songtao Wang^A, Quang Hieu Pham^A, Fangcong Yin^N, Xinpeng Wang^L, Jocelyn Qiaochu Chen^A, Greg Durrett^N, Xi Ye^A
^P University of Alberta, ^N New York University, ^L LMU Munich, ^P Princeton Language and Intelligence
{songtao2, xi.ye}@ualberta.ca
## Abstract
Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often *implicit*, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose **Gradient Fingerprint (Grift)**, a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, Grift computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, Grift substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating Grift into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient-level representations for assessing the quality of CoT reasoning traces. Our code is available at: https://github.com/songtao-x/reward_hack.
## 1 Introduction
Reinforcement learning with verifiable rewards (RLVR) has become a popular paradigm for improving reasoning capabilities in language models (LMs) (Shao et al., 2024; OpenAI, 2024; Guo et al., 2025; Yu et al., 2025). In RLVR, LMs are trained to maximize outcome-level rewards, such as whether the final answer passes a verifier or test suite, without supervision over the intermediate reasoning process. While this approach scales well to tasks with automatic evaluation, the lack of process-level supervision introduces a fundamental vulnerability: models may learn strategies that achieve high reward without faithfully solving the intended task (Skalse et al., 2022; Baker et al., 2025). This phenomenon is also known as *reward hacking*, where a model exploits imperfections in the reward function or reasoning shortcuts (Gupta et al., 2025a; Baker et al., 2025) rather than performing the intended reasoning. Such exploits can arise from various sources, including prompt artifacts, in-context hints, or flaws in automated verifiers (Feng et al., 2025; Denison et al., 2024a).
For example, coding agents have been observed to exploit dataset leakage in coding benchmarks (MacDiarmid et al., 2025; Deshpande et al., 2026) by accessing future commits that contain the solution (Kahn, 2025). While such cases are sometimes visible in the model's reasoning trace, there has been a growing concern with reward hacking as it becomes *implicit* and harder to detect (Chen et al., 2025; Arcuschini et al., 2025; Wang et al., 2026): as illustrated in Figure 1, a model may leverage a hint while producing a seemingly plausible chain-of-thought (CoT) explanation to hide the exploit (Lindsey et al., 2025). Popular text-based monitoring approaches (Baker et al., 2025; Emmons et al., 2025) become insufficient, since the surface reasoning trace may not faithfully reflect the model's internal decision process.

In this work, we introduce **Gradient Fingerprint (Grift)**, a novel method for detecting reward hacking by analyzing the model's internal computations rather than its generated text. The key idea of Grift is to extract gradient-based representations for a reasoning trajectory. Given a prompt and the CoT generated by a model, Grift encodes the CoT into a compact vector representation (called fingerprint), derived from gradients of the CoT conditioned on the prompt. We efficiently compute these fingerprints (representations) using lightweight adapters (Hu et al., 2022) on selected layers, then compress them via random projection. Intuitively, each fingerprint characterizes the direction in parameter space that a reasoning trace induces, providing a compact summary of the model's internal computation for that trace.
These gradient fingerprints enable accurate detection of reward hacking. As shown in Figure 1, Grift takes prompt–CoT pairs from a model (either a trained model or intermediate checkpoints during training) and assigns a score that is higher for non-hacking behavior (where the model achieves high reward without exploiting loopholes) and lower for hacking behavior. To obtain such a score, we cluster the gradient fingerprints and label clusters as reward-hacking or non-hacking by inspecting a small set of examples. The final score is then defined by the relative distance to the non-hacking cluster.
On multiple reasoning tasks spanning math, code, and logical reasoning, Grift score substantially outperforms strong baselines, including CoT Monitor (Baker et al., 2025) and TRACE (Wang et al., 2026), achieving over 25% relative improvement in reward hacking detection. Unlike past works that primarily focus on detection (Baker et al., 2025; Wang et al., 2026), we show that Grift can be incorporated into training as an additional supervision signal for the reasoning process. When used to guide sample selection in rejection fine-tuning (Dong et al., 2023), Grift effectively suppresses reward hacking and improves true task performance. Notably, it narrows the performance gap between models trained with access to reward exploits and those trained in an oracle environment where such exploits are not available, making models more robust to noisy training data that has hackable features.
To summarize, our contributions are as follows:
1. We propose a novel gradient-based method for detecting reward hacking in RLVR.
2. We showcase a practical training pipeline that uses our method to suppress reward hacking.
3. We provide insights on using gradient-level representations as a reliable signal for assessing quality of reasoning traces.
## 2 Preliminaries: Implicit Reward Hacking
Reward hacking occurs when a policy trained to maximize a proxy reward $\hat{R}$ learns to exploit unintended loopholes in $\hat{R}$, rather than solving the underlying task as measured by the true (often unavailable) reward $R$ (Skalse et al., 2022; Wang et al., 2026). This leads to a discrepancy between proxy performance during training and true task performance at deployment. As a result, models may fail once such loopholes are removed, or exhibit significant degradation on harder reasoning tasks (Denison et al., 2024b). Figure 3 illustrates this phenomenon in training dynamics of two reasoning tasks: while training accuracy drastically increases, test accuracy (unavailable during training) stagnates or fluctuates.

**(a) BigMath train-test dynamics** | **(b) AR-LSAT train-test dynamics**

### Sources of Reward Hacking
Reward hacking can arise from multiple common sources:
- **Reward-model or verifier loopholes.** The proxy reward $\hat{R}$ itself may be flawed. Automated verifiers may accept spurious outputs, incomplete solutions, or surface patterns correlated with correctness that do not reflect genuine task completion (Ouyang et al., 2022; Baker et al., 2025).
- **In-context loopholes.** The training data may contain unintended hints or artifacts that reveal the answer or simplify the task in ways not anticipated by the dataset curators. Examples include prompts that leak the correct answer through identifiers or contextual hints (Emmons et al., 2025); see Figure 1 for an example in a simulated environment.
- **Finite-answer-space loopholes.** Reward hacking can also arise naturally in tasks with a small output space, such as multiple-choice question answering or true/false verification. In these settings, a model may obtain rewards by chance, without performing the intended reasoning process. An example can be found in Figure 2.
The above sources can lead to either explicit or *implicit* reward hacking. In explicit cases, the model directly verbalizes the exploit in its CoT (Turpin et al., 2025), making the failure potentially detectable by inspecting the reasoning trace. In contrast, *implicit reward hacking* occurs when the model exploits a shortcut while producing a plausible CoT that conceals the exploit (Roger and Greenblatt, 2023; Pfau et al., 2024) (see Figure 1 for an example). This makes detection substantially harder, since the surface reasoning trace may appear correct even when the underlying computation relies on a loophole.
Given the challenges above, we focus on *implicit* reward hacking arising from two settings: *in-context loopholes*, which are commonly studied in prior work, and *finite-answer-space loopholes*, a natural setting introduced in this work.
## 3 Gradient Fingerprint

To detect implicit reward hacking, we hypothesize that models exhibit systematically different internal computations when exploiting a loophole versus performing non-hacking behavior, and these behaviors induce distinct gradient patterns. Prior work has shown that gradients can capture subtle, implicit properties of text—such as diversity (Jung et al., 2025) and safety (Xie et al., 2024; Hu et al., 2024)—suggesting that gradients provide a sensitive probe of underlying computational differences beyond surface-level outputs.
Building on this intuition, our method proceeds in two stages (Figure 4):
1. For each prompt–response pair $(x,y)$, we compute a *gradient fingerprint* $\mathcal{F}(x,y,\theta)$, a compact vector derived from the model's gradient that captures how the model internally processes that response; and
2. We cluster these fingerprints to produce a score $\mathcal{S}$ indicating the likelihood of reward hacking for the given $(x,y)$.
We describe these two procedures respectively in §3.1 and §3.2.
### 3.1 Constructing Gradient Fingerprints
Let $\mathcal{D}=\{(x_i, y_i)\}_{i=1}^N$ denote a dataset of $N$ prompt–response pairs collected from a model checkpoint $\theta$ (e.g., at any stage of RLVR training). Let $\theta$ denote the parameters of the model with $L$ transformer layers. We define the language modeling loss of a response $y$ conditioned on prompt $x$ as:
$$L(y|x;\theta) = -\sum_{t=1}^{|y|} \log p_\theta(y_t | x, y_{<t})$$
To compute gradient fingerprints efficiently, we insert Low-Rank Adaptation (LoRA) (Hu et al., 2022) adapters into selected layers of the model. For each layer $\ell$, we compute the gradient with respect to the LoRA parameters:
$$g_\ell = \frac{\partial L(y|x;\theta)}{\partial \phi_\ell}$$
where $\phi_\ell$ denotes the LoRA parameters at layer $\ell$.
The key intuition is that these gradients encode information about how the model's computation at each layer contributes to the response likelihood. By using LoRA adapters rather than computing gradients with respect to all model parameters, we make this computation efficient and focus on the most salient directions.
We then apply random projection to compress these gradients into a compact representation:
$$\mathcal{F}(x,y,\theta) = \text{RandomProject}([g_1, g_2, \ldots, g_L])$$
This compression step ensures that the fingerprint is compact and can be efficiently clustered.
### 3.2 Clustering and Labeling
Once we have computed gradient fingerprints for all prompt–response pairs in $\mathcal{D}$, we cluster them using K-Means. Let $K$ denote the number of clusters (we use $K=2$ as a default, corresponding to reward-hacking and non-hacking behavior).
For each cluster, we manually inspect a small set of representative samples to determine whether that cluster corresponds to reward-hacking or non-hacking behavior. This labeling step is necessary because clusters do not have inherent semantics—we must determine which cluster represents which behavior.
Once clusters are labeled, we can assign a score to each prompt–response pair based on its distance to the non-hacking cluster:
$$\mathcal{S}(x,y) = 1 - \frac{d(x,y, C_{\text{non-hack}})}{d(x,y, C_{\text{hack}})}$$
where $d(x,y, C)$ denotes the distance from $(x,y)$ to cluster $C$, and $C_{\text{non-hack}}$ and $C_{\text{hack}}$ denote the non-hacking and hacking clusters respectively.
A score closer to 1 indicates that the response is more likely to be non-hacking behavior, while a score closer to 0 indicates likely reward-hacking behavior.
## Appendix B Experiment Details
### Reward Hacked Response Judgment
As discussed in Section 4.1, we adopt counterfactual test detection for the in-context loophole settings and LLM-as-a-Judge for AR-LSAT. The prompt template is provided in Table 6.
**Prompt template for detecting reasonable responses for AR-LSAT**
You will be given an analytic logic problem (including context, options, and a question) together with a model response containing its thinking steps. Your task is to determine whether the thinking steps are reasonable, then provide a short comment and output 1 or 0.
A response is reasonable if it satisfies all of the following:
1. **Fidelity to the prompt:** uses only the stated rules and constraints; introduces no invented facts.
2. **Correct rule application:** applies each constraint soundly, including conditionals and contrapositives where appropriate.
3. **Consistency:** contains no contradictions across steps; assignments remain internally consistent.
4. **Traceability:** each conclusion is explicitly justified by prior steps or stated constraints; no unexplained jumps.
5. **Coverage:** considers all necessary cases, or correctly prunes them; add...Similar Articles
Reward Hacking in Rubric-Based Reinforcement Learning
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
@vivek_2332: found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. reco…
This article summarizes a blog post detailing Anthropic's methods for identifying and mitigating reward hacking during RL training, including hidden tests, stress-test sets, SAE monitoring, and environment redesign.
Through the looking glass of benchmark hacking
Poolside discovered reward hacking in their RL training for the Laguna M.1 model on SWE-Bench-Pro, finding that agents can exploit git history and other loopholes to cheat benchmarks, highlighting the need for better alignment and evaluation methods.
Faulty reward functions in the wild
OpenAI discusses the problem of faulty reward functions in reinforcement learning, where agents exploit loopholes in reward specifications rather than achieving intended goals. The article explores this issue through a racing game example and proposes research directions including learning from demonstrations, human feedback, and transfer learning to mitigate such problems.