The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

arXiv cs.AI 06/11/26, 04:00 AM Papers
Summary
This paper proposes a self-supervised reinforcement learning framework that uses consistency verifiers—reward functions checking geometric and semantic consistency under transformations—to improve spatial reasoning in large reasoning models without requiring ground-truth annotations. The method approaches the accuracy of supervised fine-tuning and generalizes across diverse tasks.
arXiv:2606.11918v1 Announce Type: new Abstract: Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:49 PM
# The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
Source: [https://arxiv.org/html/2606.11918](https://arxiv.org/html/2606.11918)
###### Abstract

Current Large Reasoning Models \(LRMs\) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks\. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine\-tuning \(SFT\) to ingest labeled spatial data from external vision sources or synthetic engines\. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre\-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints\. In this work, we propose a self\-supervised reinforcement learning \(RL\) framework that targets the internal reasoning process without requiring ground\-truth annotations\. By formalizing the notion of consistency verifiers — reward functions that check for geometric and semantic consistency under transformations — we demonstrate that models can improve their spatial reasoning abilities\. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport\-based RL strategy, OT\-GRPO, which is a minimal\-matching variant of group relative policy optimization tailored to pairwise verifiers\. We show that this label\-free consistency training approaches the accuracy of models trained with ground\-truth supervision and achieves similar generalization across diverse tasks and data domains\.

Machine Learning, ICML

## 1Introduction

Spatial reasoning remains a weakness for large reasoning models \(LRMs\)\. Recent benchmarks quantify the gap: models underperform humans by 30–40% on tasks such as relative position, depth ordering, and size comparison\(Stogiannidis et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib25); Yu et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib32); Cai et al\.,[2025b](https://arxiv.org/html/2606.11918#bib.bib6); Chen et al\.,[2025a](https://arxiv.org/html/2606.11918#bib.bib9)\), and on harder compositional tasks only achieve near\-random performance\(Thrush et al\.,[2022](https://arxiv.org/html/2606.11918#bib.bib26); Jiang et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib20)\)\. Failures manifest as high answer variance, sensitivity to question phrasing\(Zhang et al\.,[2024](https://arxiv.org/html/2606.11918#bib.bib34)\), and systematic violations of geometric laws — for example, a model may answer that object A is to the left of B, and also that B is to the left of A\. Reliable spatial reasoning matters for downstream 3D applications in robotics, navigation, and embodied AI, where such inconsistencies are unacceptable\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x1.png)Figure 1:Example of consistency verifier\. Given a prompt asking whether object A is left of object B, we apply transformations \(horizontal flip on the image, reformulation on the question\) to create an augmented prompt\. The consistency verifier checks whether the model’s answers satisfy the expected relationship — here, disagreement — without requiring ground\-truth labels\. For instance, if the model answersTrueon the original andFalseon the augmented prompt, given the transformation, the answers are consistent regardless of the actual spatial arrangement\.Existing approaches treat this gap as a factual knowledge deficit\. One line of work trains on fully synthetic scenes rendered from 3D simulators\(Chen et al\.,[2024b](https://arxiv.org/html/2606.11918#bib.bib8)\)\. Another annotates real images using cascades of vision models — e\.g\., depth estimators, camera calibrators, open vocabulary detectors — to generate spatial QA pairs, then trains with supervised fine\-tuning\(Chen et al\.,[2024a](https://arxiv.org/html/2606.11918#bib.bib7); Cai et al\.,[2025a](https://arxiv.org/html/2606.11918#bib.bib5); Cheng et al\.,[2024](https://arxiv.org/html/2606.11918#bib.bib11); Ma et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib21)\)\. Both approaches aim to compensate for the lack of spatial knowledge by ingesting labeled data from external sources, offering as treatment further training with additional spatial factual supervision\.

We take a different view: the necessary spatial capabilities likely exist in pre\-trained LRMs, but the model’s reasoning process lacks internal coherence and adherence to fundamental geometric principles\. Prior work has shown that sampling multiple reasoning paths and selecting the most consistent answer improves accuracy\(Wang et al\.,[2022](https://arxiv.org/html/2606.11918#bib.bib29)\), or that inconsistency under superficial prompt changes signals unreliable answers\(Zhang et al\.,[2024](https://arxiv.org/html/2606.11918#bib.bib34); Dagan et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib12)\)\. Spatial reasoning has additional structure that makes consistency appealing\. Consider the question “Is the boy to the left of the cat?”\. Flip the image: left becomes right, so the answer flips\. Crop it: the answer stays\. Swap the objects in the question: the answer flips again\. These transformations have known effects — determined by geometry and language, not scene content — and when models answer consistently under them, they tend to be correct\. Crucially, as we know when answers should match or flip, we obtain a self\-supervised reward signal without needing ground\-truth labels\.

We formalize this idea as*consistency verifiers*: reward functions that check whether two model answers satisfy the relationship induced by a transformation\. Related ideas of equivariance appear in vision\-language alignment\(Wang et al\.,[2023](https://arxiv.org/html/2606.11918#bib.bib27)\), and in cycle\-consistency, which has been used as a reward signal in other domains\(Bahng et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib2)\)\. We optimize consistency verifiers using group\-based policy optimization \(GRPO\)\(Guo et al\.,[2025a](https://arxiv.org/html/2606.11918#bib.bib17)\)\. Because our verifier scores*pairs*of completions rather than individual ones, we introduce OT\-GRPO, a minimal\-consistency matching variant that pairs completions to maximize disagreement\. This prevents the model from achieving high reward through lucky alignments and outperforms simpler pairing strategies\.

On four spatial reasoning tasks — orientation, depth, size, and relative distance — training with consistency reward alone nearly matches training with ground\-truth accuracy reward\. This finding suggests that enforcing logical coherence unlocks latent spatial reasoning abilities, without access to labels\(Yue et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib33); Chen et al\.,[2025b](https://arxiv.org/html/2606.11918#bib.bib10)\)\. We also find that consistency transfers between tasks \(e\.g\., training on orientation improves depth\) and between data domains \(indoor to outdoor\), demonstrating that label\-free consistency training generalizes almost as well as ground\-truth supervision\.

#### Contributions\.

- •We observe that consistency under geometric and semantic transformations of the prompt is an indicator of correctness in spatial reasoning, and propose to use it as a self\-supervised reward signal\.
- •We formalize*consistency verifiers*— reward functions checking answer consistency under such transformations — and introduce OT\-GRPO, a minimal\-matching scheme suited to optimize pairwise rewards with group\-based proximal policy optimization methods\.
- •We show that consistency reward alone approaches accuracy reward on four spatial reasoning tasks \(orientation, depth, size, relative distance\), and outperforms seven self\-supervised baselines including Visual Jigsaw and SSL4RL\.
- •We find that consistency training transfers across tasks \(e\.g\., depth↔\\leftrightarrowsize\) and domains \(indoor↔\\leftrightarrowoutdoor\) as effectively as accuracy training, and extends to numeric outputs \(counting, absolute distance\)\.

## 2Related Work

#### Spatial Reasoning in VLMs\.

A growing body of benchmarks documents the spatial reasoning gap in vision\-language models\. Mind the Gap\(Stogiannidis et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib25)\)evaluates relative position and finds models lag humans by over 30%\. SIBench\(Yu et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib32)\)tests perspective\-taking and metric estimation with similar findings\. EASI\(Cai et al\.,[2025b](https://arxiv.org/html/2606.11918#bib.bib6)\)provides a holistic evaluation across multiple spatial dimensions\. They reveal that strong semantic understanding does not transfer to geometric understanding\. To close this gap, prior work trains on synthetic scenes from 3D simulators\(Chen et al\.,[2024b](https://arxiv.org/html/2606.11918#bib.bib8)\)or annotates real images using depth estimators and 3D re\-constructors\(Chen et al\.,[2024a](https://arxiv.org/html/2606.11918#bib.bib7); Cai et al\.,[2025a](https://arxiv.org/html/2606.11918#bib.bib5); Cheng et al\.,[2024](https://arxiv.org/html/2606.11918#bib.bib11)\)\. Both directions require ground\-truth supervision\. We follow a different path: rather than labeling answers, we exploit the fact that well\-selected geometric or textual transformations induce*known*mappings on answers, providing a self\-supervised reward signal\.

#### Consistency as a Proxy for Correctness\.

The idea that consistency signals correctness has multiple instantiations\.Wang et al\. \([2022](https://arxiv.org/html/2606.11918#bib.bib29)\)show that sampling multiple CoT paths and marginalizing over answers \(self\-consistency\) improves reasoning accuracy\.Zuo et al\. \([2025](https://arxiv.org/html/2606.11918#bib.bib39)\)use the majority answer as a pseudo label for RL training\.Zhang et al\. \([2025](https://arxiv.org/html/2606.11918#bib.bib35)\)extend this by using cross majority voting across problem reformulations as a self\-supervised reward\.Zhao et al\. \([2025](https://arxiv.org/html/2606.11918#bib.bib36)\)use the entropy of the model as reward signal to favor consistency\. Additionally, studies of prompt sensitivity document that models change answers under superficial phrasing changes, suggesting inconsistency as a failure mode\(Zhang et al\.,[2024](https://arxiv.org/html/2606.11918#bib.bib34); Dagan et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib12)\)\. Concurrent work applies RL to promote logical consistency in VLMs\(Anonymous,[2025](https://arxiv.org/html/2606.11918#bib.bib1)\)\. In this work we train with consistency reward*only*, without relying on any ground\-truth labels\.

#### Consistency in Computer Vision\.

Consistency objectives have a long history in vision\. Cycle\-consistency enables unpaired image translation\(Zhu et al\.,[2017](https://arxiv.org/html/2606.11918#bib.bib38)\)and correspondence learning\(Zhou et al\.,[2016](https://arxiv.org/html/2606.11918#bib.bib37); Wang et al\.,[2019](https://arxiv.org/html/2606.11918#bib.bib28)\)\. Recent work uses cycle consistency as a reward for vision\-language alignment\(Bahng et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib2)\)\. Our setting differs: we verify a*known*relationship between VQA answers —invariance or equivariance — determined entirely by the transformation design, not learned from data\.

#### RL for Reasoning\.

We build on GRPO\(Guo et al\.,[2025a](https://arxiv.org/html/2606.11918#bib.bib17); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.11918#bib.bib14)\), an RL algorithm that uses verifier\-based rewards without a learned critic\. The broader role of RL in reasoning remains under debate\.Yue et al\. \([2025](https://arxiv.org/html/2606.11918#bib.bib33)\)argue that RL with verifiable rewards acts as a filter, improving sampling efficiency without expanding underlying capability\.Chen et al\. \([2025b](https://arxiv.org/html/2606.11918#bib.bib10)\)suggest RL selects among pre\-existing reasoning patterns rather than creating new ones\. Our results align with this view: consistency reward steers the model toward answers that satisfy geometric laws, surfacing accuracy already latent in the pretrained model\.

## 3Consistency Verifiers

Standard RL with verifiable rewards \(RLVR\) requires ground\-truth labels to evaluate rewards\. We introduce*consistency verifiers*—reward functions that operate without labels by verifying the model’s consistency under geometric and textual/semantic transformations between prompts\.

#### Setup\.

Let𝒟\\mathcal\{D\}be a dataset of VQA promptsx=\(I,q\)x=\(I,q\)pairing an imageIIwith a questionqq\. Given a promptxx, a modelπθ\\pi\_\{\\theta\}produces a completiony∼πθ\(⋅∣x\)y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\), which we interpret as the answer\. Our goal is to trainπθ\\pi\_\{\\theta\}to produce correct answers without access to ground\-truth labels\.

#### Prompt Augmentation\.

Starting from a promptx=\(I,q\)x=\(I,q\), we construct an*augmented*promptx′=\(I′,q′\)x^\{\\prime\}=\(I^\{\\prime\},q^\{\\prime\}\)by applying a transformationTTdrawn from a predefined set𝒯\\mathcal\{T\}\. Each transformationT∈𝒯T\\in\\mathcal\{T\}decomposes asT=\(TI,Tq\)T=\(T\_\{I\},T\_\{q\}\), whereTIT\_\{I\}is an image transformation andTqT\_\{q\}is a text transformation, yieldingx′=T\(x\)=\(TI\(I\),Tq\(q\)\)x^\{\\prime\}=T\(x\)=\(T\_\{I\}\(I\),T\_\{q\}\(q\)\)\. Each transformationTTinduces a*known*mappingϕT\\phi\_\{T\}on the answer space, which constrains how correct answers relate as a pair\. Specifically, if\(y⋆,y′⁣⋆\)\(y^\{\\star\},y^\{\\prime\\star\}\)are the correct answers to\(x,x′\)\(x,x^\{\\prime\}\), theny′⁣⋆=ϕT\(y⋆\)y^\{\\prime\\star\}=\\phi\_\{T\}\(y^\{\\star\}\)\. The transformation thus determines the relationship within the pair without revealing either answer\. Crucially, we know how the pair must relate without knowing the individual values, which enables supervision without ground\-truth\. Depending onTT,ϕT\\phi\_\{T\}takes one of two forms:

- •*Invariance*\(ϕT=id\\phi\_\{T\}=\\mathrm\{id\}\): The transformation preserves the answer\. For example, object\-preserving crops do not change spatial relationships\. A consistent pair must agree\.
- •*Equivariance*\(ϕT=¬\\phi\_\{T\}=\\neg\): The transformation changes the answer predictably\. For example, a horizontal flip swaps left and right\. Similarly, a text rewrite can swap relational words \(e\.g\.,left↔\\leftrightarrowright\) or object references \(A↔\\leftrightarrowB\)\. A consistent pair must disagree accordingly\.

BecauseϕT\\phi\_\{T\}is fully determined by the choice ofTT, we can check consistency*without*knowing the ground\-truth labels\.

#### Consistency Verifier\.

The known mappingϕT\\phi\_\{T\}allows us to check whether completions from related prompts are*mutually consistent*\. We define the*consistency verifier*as:

verifT\(y,y′\)=\{1ify′=ϕT\(y\),0otherwise,\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)=\\begin\{cases\}1&\\text\{if \}y^\{\\prime\}=\\phi\_\{T\}\(y\),\\\\ 0&\\text\{otherwise\},\\end\{cases\}wherey∼πθ\(⋅∣x\)y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)andy′∼πθ\(⋅∣x′\)y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x^\{\\prime\}\)are completions from the original and augmented prompts\. The verifier returns11if and only if the two answers satisfy the expected relationship\. BecauseϕT\\phi\_\{T\}is determined entirely by the transformation design, the verifier can be evaluated without ground\-truth labels — it provides a*self\-supervised*reward signal based on cross\-prompt consistency\.

#### Illustrative Example\.

Consider a VQA where the question isq=q="Is object A left of object B? True/False"\. We construct the augmented promptx′x^\{\\prime\}through the composition of three operations:

1. 1\.*Image*: Apply a horizontal flip, which swaps left↔\\leftrightarrowright\.
2. 2\.*Text*: Swap the object references \(A↔\\leftrightarrowB\)\.
3. 3\.*Text*: Replace the relation \(left→\\toright\)\.

Each operation negates the answer\. Since we apply three negations, the net effect is also a negation:ϕT\(y\)=¬y\\phi\_\{T\}\(y\)=\\neg y\. The key insight is that we can verify consistency*without knowing the ground truth*\. If the model answersTrueonxxandFalseonx′x^\{\\prime\}, or vice\-versa, the answers are consistent — regardless of whether object A is actually left of object B\. In other words, the verifier simply checks disagreement:verifT\(y,y′\)=𝟏\{y≠y′\}\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)=\\mathbf\{1\}\\\{y\\neq y^\{\\prime\}\\\}\(see[Figure1](https://arxiv.org/html/2606.11918#S1.F1)\)\.

## 4Learning with Consistency Verifiers

Having defined the consistency verifier, we now turn to optimization\. We build on GRPO\(Guo et al\.,[2025a](https://arxiv.org/html/2606.11918#bib.bib17); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.11918#bib.bib14)\)\. Standard GRPO assumes a scalar reward for each completion, but our consistency verifier scores*pairs*— one completion fromxxand one from its augmentationx′x^\{\\prime\}\. The main challenge is therefore to convert pairwise consistency scores into per\-completion rewards\. We recall standard GRPO, then introduce our adaptation\.

### 4\.1Background on GRPO

In the standard supervised RLVR setting, we have access to ground\-truth answersy⋆y^\{\\star\}and seek to maximize expected accuracy:

maxθ⁡𝔼x∼𝒟𝔼y∼πθ\(⋅∣x\)\[verif\(y,y⋆\)\],\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\,\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\big\[\\text\{verif\}\(y,y^\{\\star\}\)\\big\],\(1\)whereverif\(y,y⋆\)=𝟏\{y=y⋆\}\\text\{verif\}\(y,y^\{\\star\}\)=\\mathbf\{1\}\\\{y=y^\{\\star\}\\\}is the accuracy verifier\. GRPO optimizes this objective by samplingKKcompletions fromπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(a snapshot of the current policy\):

y1:K≔\(y1,…,yK\)∼πθold\(⋅∣x\)\.y\_\{1:K\}\\coloneqq\(y\_\{1\},\\ldots,y\_\{K\}\)\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid x\)\.Each completion receives a rewardri=verif\(yi,y⋆\)r\_\{i\}=\\text\{verif\}\(y\_\{i\},y^\{\\star\}\), which GRPO normalizes within the group to form advantages:

Ai≔ri−mean\(r1:K\)std\(r1:K\)\.\\displaystyle A\_\{i\}\\coloneqq\\frac\{r\_\{i\}\-\\mathrm\{mean\}\(r\_\{1:K\}\)\}\{\\mathrm\{std\}\(r\_\{1:K\}\)\}\.\(2\)
Each completionyiy\_\{i\}is a token sequence\(yi,1,…,yi,Ti\)\(y\_\{i,1\},\\ldots,y\_\{i,T\_\{i\}\}\)\. GRPO optimizes a PPO\-style clipped surrogate at the token level\. The importance ratio and clipping operator are:

ρi,t\(θ\)\\displaystyle\\rho\_\{i,t\}\(\\theta\)≔πθ\(yi,t∣x,yi,<t\)πθold\(yi,t∣x,yi,<t\),\\displaystyle\\coloneqq\\frac\{\\pi\_\{\\theta\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)\},\(3\)clipε\(u\)\\displaystyle\\mathrm\{clip\}\_\{\\varepsilon\}\(u\)≔min⁡\(max⁡\(u,1−ε\),1\+ε\)\.\\displaystyle\\coloneqq\\min\\\!\\big\(\\\!\\max\(u,1\{\-\}\\varepsilon\),\\,1\{\+\}\\varepsilon\\big\)\.\(4\)The token\-level surrogate and the full GRPO objective are:

ℓi,t\(θ;Ai\)\\displaystyle\\ell\_\{i,t\}\(\\theta;A\_\{i\}\)≔Ai⋅min⁡\(ρi,t\(θ\),clipε\(ρi,t\(θ\)\)\),\\displaystyle\\coloneqq A\_\{i\}\\cdot\\min\\\!\\big\(\\rho\_\{i,t\}\(\\theta\),\\,\\mathrm\{clip\}\_\{\\varepsilon\}\(\\rho\_\{i,t\}\(\\theta\)\)\\big\),\(5\)JGRPO\(θ\)\\displaystyle J\_\{\\mathrm\{GRPO\}\}\(\\theta\)≔1K∑i=1K1Ti∑t=1Tiℓi,t\(θ;Ai\)\.\\displaystyle\\coloneqq\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\frac\{1\}\{T\_\{i\}\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\ell\_\{i,t\}\(\\theta;A\_\{i\}\)\.\(6\)

### 4\.2From Ground\-Truth to Pairwise Consistency

With ground\-truth labels, each completionyiy\_\{i\}has a canonical “partner” — the labely⋆y^\{\\star\}— against which it is evaluated\. In our setting, we do not have access toy⋆y^\{\\star\}\. Instead, we have paired prompts\(x,x′\)\(x,x^\{\\prime\}\)wherex′=T\(x\)x^\{\\prime\}=T\(x\)for some transformationTT, and the consistency verifierverifT\(y,y′\)\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)that checks whether completionsy∼πθ\(⋅∣x\)y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\),y′∼πθ\(⋅∣x\)y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)satisfy the expected relationship induced by the transformation\. Our training objective becomes:

maxθ⁡𝔼x∼𝒟,T∼𝒯x′=T\(x\)𝔼y∼πθ\(⋅∣x\)y′∼πθ\(⋅∣x′\)\[verifT\(y,y′\)\]\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\},\\,T\\sim\\mathcal\{T\}\\\\ x^\{\\prime\}=T\(x\)\\end\{subarray\}\}\\,\\,\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\\\ y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x^\{\\prime\}\)\\end\{subarray\}\}\\big\[\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)\\big\]\.\(7\)To use GRPO, we sampleKKcompletions from each prompt, yielding aK×KK\\times Kmatrix of verifier scoresverifT\(yi,yj′\)\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{j\}\),1≤i,j≤K1\\leq i,j\\leq K\. The challenge is to reduce this matrix to per\-completion rewardsrir\_\{i\}for the GRPO update\.

#### Natural pairing strategies\.

Given the verifier matrix, two natural strategies aggregate it into per\-completion rewards:

- •*Random pairing*pairs completions arbitrarily, e\.g\., by generation order:rirand=verifT\(yi,yi′\)r\_\{i\}^\{\\mathrm\{rand\}\}=\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{i\}\)\.
- •*One\-to\-all*averages over the other group:riall=1K∑j=1KverifT\(yi,yj′\)r\_\{i\}^\{\\mathrm\{all\}\}=\\frac\{1\}\{K\}\\sum\_\{j=1\}^\{K\}\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{j\}\)\.

Both yield the same expected reward \(𝔼\[rirand\]=𝔼\[riall\]\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{rand\}\}\]=\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{all\}\}\]\), but one\-to\-all reduces variance through averaging\.

### 4\.3Adversarial Consistency Pairing

Letπθx≔πθ\(⋅∣x\)\\pi\_\{\\theta\}^\{x\}\\coloneqq\\pi\_\{\\theta\}\(\\cdot\\mid x\)andπθx′≔πθ\(⋅∣x′\)\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\\coloneqq\\pi\_\{\\theta\}\(\\cdot\\mid x^\{\\prime\}\)denote the output distributions on the original and augmented prompts\. The two strategies above correspond to samplingyyandy′y^\{\\prime\}independently, namely\(y,y′\)∼πθx⊗πθx′\(y,y^\{\\prime\}\)\\sim\\pi\_\{\\theta\}^\{x\}\\otimes\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\. More generally, we can sample\(y,y′\)\(y,y^\{\\prime\}\)from any*coupling*γ∈Γ\(πθx,πθx′\)\\gamma\\in\\Gamma\(\\pi\_\{\\theta\}^\{x\},\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\), i\.e\., any joint distribution withπθx\\pi\_\{\\theta\}^\{x\}andπθx′\\pi\_\{\\theta\}^\{x^\{\\prime\}\}as marginals\. This yields a generalized objective:

maxθ⁡𝔼x∼𝒟,T∼𝒯x′=T\(x\)𝔼\(y,y′\)∼γ\[verifT\(y,y′\)\]\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\},\\,T\\sim\\mathcal\{T\}\\\\ x^\{\\prime\}=T\(x\)\\end\{subarray\}\}\\;\\mathbb\{E\}\_\{\(y,y^\{\\prime\}\)\\sim\\gamma\}\\big\[\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)\\big\]\.\(8\)
#### Adversarial pairing\.

Now, the question is: which coupling should we choose? We propose to be adversarial: select the coupling that*minimizes*the expected consistency\. This corresponds to finding an optimal transport \(OT\) coupling\(Peyré & Cuturi,[2019](https://arxiv.org/html/2606.11918#bib.bib22)\), transforming the objective into a max\-min problem:

maxθ⁡𝔼x∼𝒟,T∼𝒯x′=T\(x\)minγ∈Γ\(πθx,πθx′\)⁡𝔼\(y,y′\)∼γ\[verifT\(y,y′\)\]\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\},\\,T\\sim\\mathcal\{T\}\\\\ x^\{\\prime\}=T\(x\)\\end\{subarray\}\}\\;\\min\_\{\\gamma\\in\\Gamma\(\\pi\_\{\\theta\}^\{x\},\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\)\}\\mathbb\{E\}\_\{\(y,y^\{\\prime\}\)\\sim\\gamma\}\\big\[\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)\\big\]\.\(9\)In practice, we approximate the OT coupling from theKKcompletions sampled from each distributionπθx\\pi\_\{\\theta\}^\{x\}andπθx′\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\. This yields a third strategy to aggregate the verifier matrix into per\-completion rewards\. By the Birkhoff–von Neumann theorem\(Birkhoff,[1946](https://arxiv.org/html/2606.11918#bib.bib3)\), the empirical OT coupling is realized by a permutation—the one minimizing total consistency\. We then define the reward as:

σ⋆∈argminσ∈𝒮K∑i=1KverifT\(yi,yσ\(i\)′\),\\displaystyle\\sigma^\{\\star\}\\in\\operatorname\*\{arg\\,min\}\_\{\\sigma\\in\\mathcal\{S\}\_\{K\}\}\\sum\_\{i=1\}^\{K\}\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{\\sigma\(i\)\}\),riOT=verifT\(yi,yσ⋆\(i\)′\)\.\\displaystyle r\_\{i\}^\{\\mathrm\{OT\}\}=\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\)\}\)\.
This is a linear assignment problem, solvable inO\(K3\)O\(K^\{3\}\)time—negligible forK≤16K\\leq 16, the range we use in practice\. In[SectionA\.2](https://arxiv.org/html/2606.11918#A1.SS2), we show that \([9](https://arxiv.org/html/2606.11918#S4.E9)\) is equivalent to minimizing the Wasserstein distance\(Santambrogio,[2015](https://arxiv.org/html/2606.11918#bib.bib23)\)betweenπθx\\pi\_\{\\theta\}^\{x\}andπθx′\\pi\_\{\\theta\}^\{x^\{\\prime\}\}, with costcT=−verifTc\_\{T\}=\-\\text\{verif\}\_\{T\}, namely theinconsistency\.

Algorithm 1One OT\-GRPO Iteration \(batch size 1\)\.1:Require:original prompt

x∼𝒟\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}x\}\\sim\\mathcal\{D\}\.

2:Transform:sample

T∼𝒯T\\sim\\mathcal\{T\}, and set

x′←T\(x\)\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}x^\{\\prime\}\}\\leftarrow T\(\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}x\}\)\.

3:Generate:

y1:K∼πθold\(⋅∣x\)\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}y\_\{1:K\}\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}x\}\)and

y1:K′∼πθold\(⋅∣x′\)\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}y^\{\\prime\}\_\{1:K\}\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\\mid\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}x^\{\\prime\}\}\)\.

4:Score\(each pair\):

Vij←verifT\(yi,yj′\)V\_\{ij\}\\leftarrow\\text\{verif\}\_\{T\}\(\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}y\_\{i\}\},\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}y^\{\\prime\}\_\{j\}\}\)\.

5:Minimal Match:

σ⋆←argminσ∈𝒮K∑iVi,σ\(i\)\\sigma^\{\\star\}\\leftarrow\\operatorname\*\{arg\\,min\}\_\{\\sigma\\in\\mathcal\{S\}\_\{K\}\}\\sum\_\{i\}V\_\{i,\\sigma\(i\)\}\.

6:Rewards:

ri←Vi,σ⋆\(i\)\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}r\_\{i\}\}\\leftarrow V\_\{i,\\sigma^\{\\star\}\(i\)\}, and

rσ⋆\(i\)′←ri\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}r^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\)\}\}\\leftarrow\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}r\_\{i\}\}\.

7:GRPO Updateson

\(y1:K,r1:K\)\(\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}y\_\{1:K\}\},\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}r\_\{1:K\}\}\)and

\(y1:K′,r1:K′\)\(\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}y^\{\\prime\}\_\{1:K\}\},\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}r^\{\\prime\}\_\{1:K\}\}\)\.

#### OT\-GRPO loss\.

We derive a GRPO surrogate for the max\-min objective \([9](https://arxiv.org/html/2606.11918#S4.E9)\)\. Each matched pair\(yi,yσ⋆\(i\)′\)\(y\_\{i\},y^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\)\}\)receives the same rewardriOT=verifT\(yi,yσ⋆\(i\)′\)r\_\{i\}^\{\\text\{OT\}\}=\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\)\}\), normalized into a shared advantageAiA\_\{i\}\. We apply the standard GRPO update to both completions:

JOT\-GRPO\(θ\)=1K∑i=1K1Ti∑t=1Tiℓi,t\(θ;Ai\)⏟completions to orginal promptyi1K∑i=1K1Tσ⋆\(i\)′∑t=1Tσ⋆\(i\)′ℓσ⋆\(i\),t′\(θ;Ai\)⏟completions to augmented promptyσ⋆\(i\)′\.\\begin\{split\}J\_\{\\mathrm\{OT\\text\{\-\}GRPO\}\}\(\\theta\)&=\\underbrace\{\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\frac\{1\}\{T\_\{i\}\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\ell\_\{i,t\}\(\\theta;A\_\{i\}\)\}\_\{\\text\{completions to orginal prompt \}y\_\{i\}\}\\\\ &\\quad\\underbrace\{\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\frac\{1\}\{T^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\)\}\}\\sum\_\{t=1\}^\{T^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\)\}\}\\ell^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\),t\}\(\\theta;A\_\{i\}\)\}\_\{\\text\{completions to augmented prompt \}y^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\)\}\}\.\\end\{split\}\(10\)We treatσ⋆\\sigma^\{\\star\}as constant when computing gradients\. This is justified byDanskin \([1966](https://arxiv.org/html/2606.11918#bib.bib13)\)’s theorem, sinceσ⋆\\sigma^\{\\star\}is the minimizer of the inner problem\. We stress that both original and augmented prompts contribute to the training loss — answers to augmented prompts do not merely provide pseudo\-labels\. Alg\.[1](https://arxiv.org/html/2606.11918#alg1)summarizes one OT\-GRPO iteration\.

#### Why Minimal Consistency?

Minimal consistency finds the*most challenging*pairing\. If a model produces inconsistent completions, minimal consistency will find and expose those inconsistencies, whereas random pairing might miss them by luck\. Conversely, if a model is truly consistent, minimal consistency cannot expose disagreement because none exists\. This intuition can be made precise\. Under a random baseline where the model guesses uniformly at random, random pairing and one\-to\-all both yield expected reward𝔼\[rirand\]=𝔼\[riall\]=1/2\\mathbb\{E\}\[r\_\{i\}^\{\\text\{rand\}\}\]=\\mathbb\{E\}\[r\_\{i\}^\{\\text\{all\}\}\]=1/2\. Minimal consistency, however, yields𝔼\[riOT\]≈1/πK\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{OT\}\}\]\\approx 1/\\sqrt\{\\pi K\}, which vanishes asKKgrows \(see[AppendixA](https://arxiv.org/html/2606.11918#A1)for derivation\)\. In our label\-free setting, this makes minimal consistency robust to reward hacking: high rewards require genuine consistency everywhere\.

### 4\.4Reasoning format

We promote CoT reasoning\(Wei et al\.,[2023](https://arxiv.org/html/2606.11918#bib.bib30)\)by prompting the model with a system instruction adapted from DeepSeek\-R1\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.11918#bib.bib14)\)\. Completions are expected to follow a structured format: reasoning enclosed in<think\>\.\.\.</think\>tags, followed by an answer in<answer\>\.\.\.</answer\>tags\. In addition to the consistency reward, we apply a format reward that checks whether the completion adheres to this structure\. We use a weight of 1\.0 for both consistency and format rewards\.

Table 1:Total number of examples per task and domain\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x2.png)Figure 2:Same\-task evaluation accuracy on SUN RGB\-D \(indoor\) data for two model sizes \(3B and 7B\) and four tasks \(Depth, Orientation, Size and Relative Distance\)\. We separately train models using either an accuracy verifier \(requires ground\-truth labels\) or a consistency verifier \(no labels needed\), then evaluate on held\-out test samples\. Despite never seeing any ground\-truth labels during training, consistency\-trained models nearly match accuracy\-trained ones, with average gaps of only 2\.3pp \(3B\) and 2\.7pp \(7B\)\.

## 5Experiments

### 5\.1Experimental Setup

#### Tasks\.

We evaluate on four binary VQA tasks covering complementary spatial dimensions:*orientation*\(left/right\),*depth*\(closer/further to camera\),*size*\(3D volume/metric comparison\), and*relative distance*\(which object is closer to an anchor\)\. Each task highlights objects in the image and asks a True/False question\. Ground truth comes from 3D bounding box annotations; see[AppendixD](https://arxiv.org/html/2606.11918#A4)for details\.

#### Data\.

We use images from KITTI \(outdoor driving\) and SUN RGB\-D \(indoor scenes\) with Omni3D 3D annotations\(Brazil et al\.,[2023](https://arxiv.org/html/2606.11918#bib.bib4)\)\. To ensure unambiguous ground truth, we filter individual objects by visibility, size, and camera\-relative position, and pairs by bounding\-box overlap and a minimum metric gap along the task\-relevant dimension; remaining pairs are scored by separation and sampled to balance diversity with unambiguity\. Each task uses five paraphrased question templates sampled at training time\. Set sizes are provided in Table[1](https://arxiv.org/html/2606.11918#S4.T1); see[AppendicesD](https://arxiv.org/html/2606.11918#A4)and[D\.5](https://arxiv.org/html/2606.11918#A4.SS5)for full details and example prompts\. All models train on SUN RGB\-D; KITTI is held out to test domain generalization\.

#### Transformations\.

We construct augmented prompts by applying image and text transformations, each sampled independently with probability 0\.5\. Image transforms include horizontal flip, object\-preserving crop, and color jitter\. Text transforms include swapping object order and swapping the queried relation \(e\.g\., left↔\\leftrightarrowright\)\. Some transforms are*invariant*\(answer unchanged\); others are*equivariant*\(answer flips\)\. The consistency verifier uses the known effect of each transform to determine whether answers should match\.

Models\.We fine\-tune Qwen2\.5\-VL 3B and 7B with LoRA\(Hu et al\.,[2021](https://arxiv.org/html/2606.11918#bib.bib19)\)withr=32r=32andα=64\\alpha=64applied to attention and MLP layers\. The vision encoder is frozen\.

#### Training\.

Each task/source uses an 80/20 split with a fixed seed shared across experiments\. Because dataset sizes vary \(Table[1](https://arxiv.org/html/2606.11918#S4.T1)\), we fix training steps rather than epochs:T=500T=500steps withη=10−6\\eta=10^\{\-6\}, group sizeK=8K=8,τ=1\.0\\tau=1\.0, and no KL penalty \(β=0\\beta=0\)\. Each step samples 4 prompt pairs per device across 8 H100 GPUs \(256K completions total\)\. For minimal pairing we use the network\-simplex solver from POT\(Flamary et al\.,[2021](https://arxiv.org/html/2606.11918#bib.bib15)\)\.

#### Evaluation\.

We evaluate on the held\-out 20% test split of every \(task, source\) combination and use two training regimes\. In the*per\-task*regime \([Figures2](https://arxiv.org/html/2606.11918#S4.F2),[3](https://arxiv.org/html/2606.11918#S5.F3),[4](https://arxiv.org/html/2606.11918#S5.F4),[7](https://arxiv.org/html/2606.11918#S5.F7)and[8](https://arxiv.org/html/2606.11918#S5.F8)\), each model is trained on a single task’s SUN RGB\-D training set and evaluated on the test splits of all \(task, source\) pairs, yielding the same\-task, cross\-task, and cross\-domain measurements reported below\. In the*joint*regime \([Figures5](https://arxiv.org/html/2606.11918#S5.F5)and[6](https://arxiv.org/html/2606.11918#S5.F6)\), our model is trained on the union of the four boolean SUN RGB\-D training sets; we then report the average test accuracy across the four SUN RGB\-D tasks, with KITTI held out\. For every self\-supervised baseline appearing in[Figure5](https://arxiv.org/html/2606.11918#S5.F5)\(Visual Jigsaw and SSL4RL variants\), we use the publicly released Hugging Face checkpoint without further fine\-tuning on our data\.

#### Accuracy vs\. Consistency Training\.

We compare two reward signals:*accuracy training*rewards correct answers using ground\-truth labels \(1/0\), while*consistency training*hides labels entirely and rewards predictions that satisfy the expected consistency relationship under transformations\. For fairness, both methods share the same data, the same online\-sampled image and text augmentations, and the same held\-out test set\.

### 5\.2In\-Depth Comparison: Accuracy vs\. Consistency

#### Consistency approaches ground\-truth accuracy\.

Figure[2](https://arxiv.org/html/2606.11918#S4.F2)compares baseline, accuracy training, and consistency training on SUN RGB\-D\. Consistency closely tracks accuracy: 76\.8% vs\. 79\.6% for 7B \(gap 2\.8pp\) and 74\.1% vs\. 76\.4% for 3B \(gap 2\.3pp\), both well above baseline \(63\.4% / 58\.2%\)\. The gap is stable across tasks \(2\.1–3\.4pp for 7B\), showing that the self\-supervised consistency signal recovers most of the gain achievable from labels\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x3.png)Figure 3:Cross\-task transfer on SUN RGB\-D \(7B model\)\. Each cell shows the improvement over the pre\-training baseline \(in percentage points\) when training on the row task and evaluating on the column task\. Off\-diagonal cells \(colored\) show cross\-task transfer; diagonal cells \(gray\) show same\-task performance for reference\. The rightmost panel shows the average off\-diagonal improvement: consistency training nearly matches accuracy training with a gap of only 0\.8pp\.
#### Consistency transfers across tasks\.

Figure[3](https://arxiv.org/html/2606.11918#S5.F3)reports cross\-task transfer \(7B, SUN RGB\-D\), with cells showing improvement over baseline by training/eval task\. All 12 off\-diagonal entries are positive for both methods \(worst: orientation→\\tosize, \+5\.8pp\), giving average gains of \+10\.1pp \(accuracy\) vs\. \+9\.2pp \(consistency\); same\-task \(diagonal\) gains are \+14\.3pp vs\. \+13\.5pp—a sub\-1pp gap throughout\. Transfer is asymmetric: depth yields the largest average cross\-task gain \(\+12\.3 / \+11\.1pp\) while orientation gives the least \(\+7\.8 / \+7\.2pp\)\. This is consistent with depth being the only task whose answer depends on a continuous 3D axis that also underlies size and inter\-object distance comparisons, while orientation is essentially a 2D judgment whose representations transfer least to the other three tasks\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x4.png)Figure 4:Cross\-domain transfer from SUN RGB\-D to KITTI \(7B model\)\. Models are trained on indoor scenes \(SUN RGB\-D\) and evaluated on outdoor driving scenes \(KITTI\)\. Diagonal cells \(gray\) show same\-task cross\-domain transfer; off\-diagonal cells \(colored\) combine both task and domain shifts\. Despite the visual gap between indoor and outdoor environments, both training methods generalize well\. Consistency training nearly matches accuracy training, with off\-diagonal gaps of only 0\.5pp\.
#### Consistency transfers across tasks and domains\.

Figure[4](https://arxiv.org/html/2606.11918#S5.F4)shows transfer from SUN RGB\-D \(indoor\) to KITTI \(outdoor\); diagonals capture same\-task domain shift, off\-diagonals combine task and domain shift\. Same\-task gains over the KITTI baseline are \+15\.6pp \(accuracy\) vs\. \+14\.7pp \(consistency\); cross\-task gains are \+10\.9 vs\. \+10\.5pp—again a sub\-1pp gap\. Despite a higher KITTI baseline \(68\.2% vs\. 63\.4% for 7B\), same\-task cross\-domain transfer \(\+15pp\) exceeds same\-task within\-domain gains on SUN RGB\-D \(\+13\.5pp\), suggesting spatial reasoning learned on cluttered indoor scenes generalizes well to sparser outdoor environments and that task alignment matters more than domain similarity\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x5.png)Figure 5:Comparison against self\-supervised baselines on SUN RGB\-D \(val\), averaged over the four boolean tasks\. Our method \(acc\. and cons\.\) uses Qwen2\.5\-VL\-7B trained jointly on the four boolean tasks \(2,000 steps\); Visual Jigsaw and SSL4RL variants are evaluated from their publicly released Hugging Face checkpoints, without further fine\-tuning on our data\. Our consistency reward \(no labels\) outperforms every self\-supervised baseline and trails accuracy training \(with labels\) by only 2\.7pp\.

### 5\.3Comparison to Self\-Supervised Baselines

#### Consistency outperforms self\-supervised baselines\.

Across both families of self\-supervised baselines, label\-free consistency wins outright \(Figure[5](https://arxiv.org/html/2606.11918#S5.F5)\)\. To benchmark against published methods that share a task pool, we train jointly on the four boolean tasks \(Qwen2\.5\-VL\-7B, 2,000 steps on∼\\sim40K shuffled SUN RGB\-D examples, all other hyperparameters as above\) and compare against the Visual Jigsaw and SSL4RL checkpoints released by the authors on Hugging Face\. At 83\.8%, our reward beats the strongest Visual Jigsaw variant\(Wu et al\.,[2025](https://arxiv.org/html/2606.11918#bib.bib31)\)\(Image, Video, 3D\) by \+3\.6pp and four SSL4RL variants\(Guo et al\.,[2025b](https://arxiv.org/html/2606.11918#bib.bib18)\)\(Rotation, Position, Contrastive, Jigsaw\) by \+13–24pp; accuracy training \(with labels\) tops the table at 86\.5%, leaving only a 2\.7pp gap to label\-free consistency\. Rotation/jigsaw\-style pretext tasks transfer the worst—several land below the 60\.6% pre\-training baseline\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x6.png)Figure 6:Robustness to label corruption\. “Acc\. \+NN%” flipsNN% of training labels uniformly at random in accuracy training; consistency training uses no labels at all \(orange dashed line\)\. Accuracy still edges out consistency at 10% corruption, but the label\-free signal overtakes it from 20% onward\.

### 5\.4Robustness to Label Corruption

#### Consistency overtakes accuracy from 20% noise\.

Reusing the all\-tasks protocol above, we corrupt accuracy training by flipping each ground\-truth label independently with probabilityp∈\{0\.1,0\.2,0\.3,0\.4\}p\\in\\\{0\.1,0\.2,0\.3,0\.4\\\}before computing the reward; consistency, reading no labels, is unaffected\. Clean accuracy reaches 86\.5% and consistency 83\.8%\. At 10% corruption accuracy drops to 84\.1%, essentially matching the label\-free signal \(within 0\.3pp\); from 20% onward consistency overtakes accuracy outright, with gaps of 1\.2pp \(Acc\. 82\.6%\), 4\.2pp \(79\.6%\), and 7\.4pp \(76\.4%\) at 20%, 30%, and 40% noise, respectively \(Figure[6](https://arxiv.org/html/2606.11918#S5.F6)\)\. Our uniform per\-example flips are a conservative noise model: real annotation pipelines tend to produce errors that are*correlated*and concentrated on the harder examples, so the crossover where consistency overtakes accuracy would likely arrive below the 20% reported here\. This matters in practice: spatial\-reasoning annotation pipelines chain depth estimators, calibrators, detectors, and frontier LLMs\(Chen et al\.,[2024a](https://arxiv.org/html/2606.11918#bib.bib7); Cheng et al\.,[2024](https://arxiv.org/html/2606.11918#bib.bib11)\), so even small per\-stage error rates compound into corruption levels where consistency overtakes supervision\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x7.png)Figure 7:Extension to numeric tasks\. Numeric accuracy verifierverif\(y,y⋆\)=max⁡\(0,1−\|y−y⋆\|/y⋆\)\\text\{verif\}\(y,y^\{\\star\}\)=\\max\(0,1\-\|y\-y^\{\\star\}\|/y^\{\\star\}\), reported as a percentage, on counting \(integer object counts\) and absolute distance estimation \(meters\)\. Consistency closely tracks accuracy on both tasks, trailing by 0\.5pp on counting and 2\.3pp on absolute distance\.

### 5\.5Exploration on Numeric Tasks

#### Consistency closely matches accuracy on numeric outputs\.

We extend the framework to two numeric tasks on SUN RGB\-D:*counting*\(integer in\{2,…,5\}\\\{2,\\ldots,5\\\}, e\.g\.,How many chairs are in the image?\) and*absolute distance estimation*\(continuous meters, e\.g\.,What is the distance between object 1 and object 2 in meters?; examples in[SectionD\.7](https://arxiv.org/html/2606.11918#A4.SS7)\)\. We reuse the same image and text augmentations introduced for the boolean tasks but drop relation swap, since neither a count nor a metric distance admits an equivariant relation to negate\. All retained transformations are therefore*invariant*: jitter, mirroring, cropping around the queried objects, and paraphrasing the question leave the underlying count or 3D distance unchanged\. In this regime the consistency verifier collapses to a single invariance check—the model should produce the same answer on both slots of a pair\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x8.png)Figure 8:Ablation on pairing strategy for consistency training \(7B model, SUN RGB\-D\)\. Each point compares minimal pairing \(y\-axis\) against an alternative strategy \(x\-axis\): random \(purple circles\) or one\-to\-all \(cyan triangles\)\. Points above the diagonal indicate minimal pairing outperforms the alternative\.Left:Self\-task accuracy—each point is one of the four tasks, showing that minimal pairing achieves higher accuracy\.Right:Cross\-task transfer improvement over baseline \(Δ\\Deltain percentage points\)—each point is one of the 12 off\-diagonal train/eval task pairs, showing that minimal pairing yields larger transfer gains\. The green region highlights where minimal pairing wins\.Because the prediction is now a continuous quantity rather than a True/False label, both the accuracy and the consistency verifiers from[Section3](https://arxiv.org/html/2606.11918#S3)have to be adapted: exact equality is too strict, so we replace it with a*rescaled mean absolute error*\. Each verifier returns11when its two arguments coincide, decreases linearly in their absolute difference, and clips to0once that difference exceeds the reference value\. The accuracy verifier rescales the error by the ground truthy⋆y^\{\\star\}, while the \(symmetric\) consistency verifier rescales by the larger of the two predictions, keeping both scores in\[0,1\]\[0,1\]:

verif\(y,y⋆\)\\displaystyle\\text\{verif\}\(y,y^\{\\star\}\)=max⁡\(0,1−\|y−y⋆\|/y⋆\),\\displaystyle=\\max\\\!\\big\(0,\\,1\-\|y\-y^\{\\star\}\|/y^\{\\star\}\\big\),verifT\(y,y′\)\\displaystyle\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)=max⁡\(0,1−\|y−y′\|/max⁡\(y,y′\)\)\.\\displaystyle=\\max\\\!\\big\(0,\\,1\-\|y\-y^\{\\prime\}\|/\\max\(y,y^\{\\prime\}\)\\big\)\.We train Qwen2\.5\-VL\-7B per task for 500 steps with all other hyperparameters identical to the boolean setting\. Consistency closely tracks accuracy on both tasks \(Figure[7](https://arxiv.org/html/2606.11918#S5.F7)\): on counting, accuracy reaches 68\.8% and consistency 68\.3%—a 0\.5pp gap from a base near chance \(0\.1%\); on absolute distance, accuracy hits 41\.4% versus 39\.1% for consistency, a 2\.3pp gap\. In both cases the label\-free signal trails the labeled one by less than the boolean\-task gap \(2\.7pp\), suggesting the consistency reward remains competitive as we extend beyond binary answer spaces\.

### 5\.6Ablation: Pairing Strategy

#### Minimal pairing outperforms alternatives\.

Figure[8](https://arxiv.org/html/2606.11918#S5.F8)compares minimal OT pairing to random and one\-to\-all pairing \(7B, SUN RGB\-D\)\. Minimal pairing wins on both axes: self\-task accuracy of 76\.8% vs\. 74\.7% \(random\) and 75\.8% \(one\-to\-all\), and cross\-task transfer of \+11\.7pp vs\. \+8\.2pp and \+10\.3pp—an advantage that holds across all 4 self\-task and 12 cross\-task settings\. The cost is negligible: 23\.2s/step vs\. 22\.9s and 23\.1s, under 1\.5% overhead\. The gain stems from harder negative pairs: matching each completion with its most challenging counterpart makes random agreement unlikely \(see[AppendixA](https://arxiv.org/html/2606.11918#A1)\)\.

## 6Conclusion

We introduced consistency verifiers for post\-training VLMs on spatial reasoning without ground\-truth labels\. By exploiting known relationships between answers under geometric and semantic transformations, our approach provides a self\-supervised reward signal that approaches the performance of ground\-truth supervision — establishing the value of RL post\-training for key invariances, equivariances and adherence to general spatial reasoning principles\. Experiments demonstrate consistent benefits and generalization across four tasks \(orientation, depth, size, relative distance\) and two data domains \(indoor, outdoor\)\.

#### Limitations\.

Our consistency rewards compare two predictions at a time\. Extending to richer relational structures—chains of three or more transformations, or compositional reasoning steps—could exploit additional geometric structure\. Beyond spatial reasoning, applying consistency verifiers to other modalities is a natural next step\.

## Impact Statement

This work presents a self\-supervised method for improving spatial reasoning in VLMs by exploiting consistency under geometric transformations\. The approach reduces reliance on labeled data, which may lower annotation costs and broaden access to model improvement\. We do not anticipate specific negative societal consequences beyond those generally associated with advances in machine learning\.

## References

- Anonymous \(2025\)Anonymous\.Be consistent\! enhancing robust visual reasoning in LVLMs with consistency constraints\.ICLR 2026 Conference Submission 6260, 2025\.URL[https://openreview\.net/forum?id=REPLACE\_WITH\_ID](https://openreview.net/forum?id=REPLACE_WITH_ID)\.
- Bahng et al\. \(2025\)Bahng, H\., Chan, C\., Durand, F\., and Isola, P\.Cycle consistency as reward: Learning image\-text alignment without human preferences\.2025\.
- Birkhoff \(1946\)Birkhoff, G\.Tres observaciones sobre el álgebra lineal\.*Universidad Nacional de Tucumán Revista Serie A*, 5:147–151, 1946\.
- Brazil et al\. \(2023\)Brazil, G\., Straub, J\., Ravi, N\., Johnson, J\., and Gkioxari, G\.Omni3D: A large benchmark and model for 3D object detection in the wild\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 13154–13164, 2023\.
- Cai et al\. \(2025a\)Cai, W\., Ponomarenko, Y\., Yuan, J\., Li, X\., Yang, W\., Dong, H\., and Zhao, B\.Spatialbot: Precise spatial understanding with vision language models\.In*2025 IEEE International Conference on Robotics and Automation \(ICRA\)*, pp\. 9490–9498\. IEEE, 2025a\.
- Cai et al\. \(2025b\)Cai, Z\., Wang, Y\., Sun, Q\., Wang, R\., Gu, C\., Yin, W\., Lin, Z\., Yang, Z\., Wei, C\., Shi, X\., Deng, K\., Han, X\., Chen, Z\., Li, J\., Fan, X\., Deng, H\., Lu, L\., Li, B\., Liu, Z\., Wang, Q\., Lin, D\., and Yang, L\.Holistic evaluation of multimodal llms on spatial intelligence\.*arXiv preprint arXiv:2508\.13142*, 2025b\.
- Chen et al\. \(2024a\)Chen, B\., Xu, Z\., Kirmani, S\., Ichter, B\., Driess, D\., Florence, P\., Sadigh, D\., Guibas, L\., and Xia, F\.Spatialvlm: Endowing vision\-language models with spatial reasoning capabilities\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 14455–14465, 2024a\.
- Chen et al\. \(2024b\)Chen, J\. et al\.Sprite: Scaling spatial reasoning in mllms through programmatic data synthesis\.*arXiv preprint arXiv:2512\.16237*, 2024b\.
- Chen et al\. \(2025a\)Chen, W\. et al\.Space\-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence\.*arXiv preprint arXiv:2506\.07966*, 2025a\.
- Chen et al\. \(2025b\)Chen, X\., Li, T\., and Zou, D\.On the mechanism of reasoning pattern selection in reinforcement learning for language models\.*arXiv preprint arXiv:2506\.04695*, 2025b\.
- Cheng et al\. \(2024\)Cheng, A\.\-C\., Yin, H\., Fu, Y\., Guo, Q\., Yang, R\., Kautz, J\., Wang, X\., and Molchanov, P\.Spatialrgpt: Grounded spatial reasoning in vision\-language models\.In*Advances in Neural Information Processing Systems*, volume 37, 2024\.
- Dagan et al\. \(2025\)Dagan, G\., Loginova, O\., and Batra, A\.Cast: Cross\-modal alignment similarity test for vision language models\.In*Proceedings of the 31st International Conference on Computational Linguistics*, pp\. 1387–1402, 2025\.
- Danskin \(1966\)Danskin, J\. M\.*The Theory of Max\-Min and its Application to Weapons Allocation Problems*\.Springer, Berlin, 1966\.
- DeepSeek\-AI \(2025\)DeepSeek\-AI\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025\.URL[https://arxiv\.org/abs/2501\.12948](https://arxiv.org/abs/2501.12948)\.
- Flamary et al\. \(2021\)Flamary, R\., Courty, N\., Gramfort, A\., Alaya, M\. Z\., Boisbunon, A\., Chambon, S\., Chapel, L\., Corenflos, A\., Fatras, K\., Fournier, N\., Gautheron, L\., Gayraud, N\. T\., Janati, H\., Rakotomamonjy, A\., Redko, I\., Rolet, A\., Schutz, A\., Seguy, V\., Sutherland, D\. J\., Tavenard, R\., Tong, A\., and Vayer, T\.POT: Python Optimal Transport\.*Journal of Machine Learning Research*, 22\(78\):1–8, 2021\.
- Geiger et al\. \(2012\)Geiger, A\., Lenz, P\., and Urtasun, R\.Are we ready for autonomous driving? The KITTI vision benchmark suite\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 3354–3361, 2012\.
- Guo et al\. \(2025a\)Guo, D\., Yang, D\., Zhang, H\., Song, J\., Wang, P\., Zhu, Q\., Xu, R\., Zhang, R\., Ma, S\., Bi, X\., Zhang, X\., Yu, X\., Wu, Y\., Wu, Z\. F\., Gou, Z\., Shao, Z\., Li, Z\., Gao, Z\., Liu, A\., Xue, B\., Wang, B\., Wu, B\., Feng, B\., Lu, C\., Zhao, C\., Deng, C\., Ruan, C\., Dai, D\., Chen, D\., Ji, D\., Li, E\., Lin, F\., Dai, F\., Luo, F\., Hao, G\., Chen, G\., Li, G\., Zhang, H\., Xu, H\., Ding, H\., Gao, H\., Qu, H\., Li, H\., Guo, J\., Li, J\., Chen, J\., Yuan, J\., Tu, J\., Qiu, J\., Li, J\., Cai, J\. L\., Ni, J\., Liang, J\., Chen, J\., Dong, K\., Hu, K\., You, K\., Gao, K\., Guan, K\., Huang, K\., Yu, K\., Wang, L\., Zhang, L\., Zhao, L\., Wang, L\., Zhang, L\., Xu, L\., Xia, L\., Zhang, M\., Zhang, M\., Tang, M\., Zhou, M\., Li, M\., Wang, M\., Li, M\., Tian, N\., Huang, P\., Zhang, P\., Wang, Q\., Chen, Q\., Du, Q\., Ge, R\., Zhang, R\., Pan, R\., Wang, R\., Chen, R\. J\., Jin, R\. L\., Chen, R\., Lu, S\., Zhou, S\., Chen, S\., Ye, S\., Wang, S\., Yu, S\., Zhou, S\., Pan, S\., Li, S\. S\., Zhou, S\., Wu, S\., Yun, T\., Pei, T\., Sun, T\., Wang, T\., Zeng, W\., Liu, W\., Liang, W\., Gao, W\., Yu, W\., Zhang, W\., Xiao, W\. L\., An, W\., Liu, X\., Wang, X\., Chen, X\., Nie, X\., Cheng, X\., Liu, X\., Xie, X\., Liu, X\., Yang, X\., Li, X\., Su, X\., Lin, X\., Li, X\. Q\., Jin, X\., Shen, X\., Chen, X\., Sun, X\., Wang, X\., Song, X\., Zhou, X\., Wang, X\., Shan, X\., Li, Y\. K\., Wang, Y\. Q\., Wei, Y\. X\., Zhang, Y\., Xu, Y\., Li, Y\., Zhao, Y\., Sun, Y\., Wang, Y\., Yu, Y\., Zhang, Y\., Shi, Y\., Xiong, Y\., He, Y\., Piao, Y\., Wang, Y\., Tan, Y\., Ma, Y\., Liu, Y\., Guo, Y\., Ou, Y\., Wang, Y\., Gong, Y\., Zou, Y\., He, Y\., Xiong, Y\., Luo, Y\., You, Y\., Liu, Y\., Zhou, Y\., Zhu, Y\. X\., Huang, Y\., Li, Y\., Zheng, Y\., Zhu, Y\., Ma, Y\., Tang, Y\., Zha, Y\., Yan, Y\., Ren, Z\. Z\., Ren, Z\., Sha, Z\., Fu, Z\., Xu, Z\., Xie, Z\., Zhang, Z\., Hao, Z\., Ma, Z\., Yan, Z\., Wu, Z\., Gu, Z\., Zhu, Z\., Liu, Z\., Li, Z\., Xie, Z\., Song, Z\., Pan, Z\., Huang, Z\., Xu, Z\., Zhang, Z\., and Zhang, Z\.Deepseek\-r1 incentivizes reasoning in llms through reinforcement learning\.*Nature*, 645\(8081\):633–638, September 2025a\.ISSN 1476\-4687\.doi:10\.1038/s41586\-025\-09422\-z\.URL[http://dx\.doi\.org/10\.1038/s41586\-025\-09422\-z](http://dx.doi.org/10.1038/s41586-025-09422-z)\.
- Guo et al\. \(2025b\)Guo, X\., Zhou, R\., Wang, Y\., Zhang, Q\., Zhang, C\., Jegelka, S\., Wang, X\., Chai, J\., Yin, G\., Lin, W\., and Wang, Y\.SSL4RL: Revisiting self\-supervised learning as intrinsic reward for visual\-language reasoning, 2025b\.URL[https://arxiv\.org/abs/2510\.16416](https://arxiv.org/abs/2510.16416)\.
- Hu et al\. \(2021\)Hu, E\. J\., Shen, Y\., Wallis, P\., Allen\-Zhu, Z\., Li, Y\., Wang, S\., Wang, L\., and Chen, W\.Lora: Low\-rank adaptation of large language models, 2021\.URL[https://arxiv\.org/abs/2106\.09685](https://arxiv.org/abs/2106.09685)\.
- Jiang et al\. \(2025\)Jiang, Y\., Chai, Y\., Brbić, M\., and Moor, M\.Marble: A hard benchmark for multimodal spatial reasoning and planning\.*arXiv preprint arXiv:2506\.22992*, 2025\.
- Ma et al\. \(2025\)Ma, W\., Chou, Y\.\-C\., Liu, Q\., Wang, X\., de Melo, C\., Xie, J\., and Yuille, A\.SpatialReasoner: Towards explicit and generalizable 3d spatial reasoning, 2025\.URL[https://arxiv\.org/abs/2504\.20024](https://arxiv.org/abs/2504.20024)\.
- Peyré & Cuturi \(2019\)Peyré, G\. and Cuturi, M\.Computational optimal transport: With applications to data science\.*Foundations and Trends in Machine Learning*, 11\(5\-6\):355–607, 2019\.
- Santambrogio \(2015\)Santambrogio, F\.*Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling*\.Birkhäuser, Cham, 2015\.
- Song et al\. \(2015\)Song, S\., Lichtenberg, S\. P\., and Xiao, J\.SUN RGB\-D: A RGB\-D scene understanding benchmark suite\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 567–576, 2015\.
- Stogiannidis et al\. \(2025\)Stogiannidis, I\., McDonagh, S\., and Tsaftaris, S\. A\.Mind the gap: Benchmarking spatial reasoning in vision\-language models\.*arXiv preprint arXiv:2503\.19707*, 2025\.
- Thrush et al\. \(2022\)Thrush, T\., Jiang, R\., Bartolo, M\., Singh, A\., Williams, A\., Kiela, D\., and Ross, C\.Winoground: Probing vision and language models for visio\-linguistic compositionality\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 5238–5248, 2022\.
- Wang et al\. \(2023\)Wang, T\., Lin, K\., Li, L\., Lin, C\.\-C\., Yang, Z\., Zhang, H\., Liu, Z\., and Wang, L\.Equivariant similarity for vision\-language foundation models\.In*Proceedings of the IEEE/CVF International Conference on Computer Vision*, pp\. 11998–12008, 2023\.
- Wang et al\. \(2019\)Wang, X\., Jabri, A\., and Efros, A\. A\.Learning correspondence from the cycle\-consistency of time\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pp\. 2566–2576, 2019\.
- Wang et al\. \(2022\)Wang, X\., Wei, J\., Schuurmans, D\., Le, Q\., Chi, E\., Narang, S\., Chowdhery, A\., and Zhou, D\.Self\-consistency improves chain of thought reasoning in language models\.*arXiv preprint arXiv:2203\.11171*, 2022\.
- Wei et al\. \(2023\)Wei, J\., Wang, X\., Schuurmans, D\., Bosma, M\., Ichter, B\., Xia, F\., Chi, E\., Le, Q\., and Zhou, D\.Chain\-of\-thought prompting elicits reasoning in large language models, 2023\.URL[https://arxiv\.org/abs/2201\.11903](https://arxiv.org/abs/2201.11903)\.
- Wu et al\. \(2025\)Wu, P\., Zhang, Y\., Diao, H\., Li, B\., Lu, L\., and Liu, Z\.Visual jigsaw post\-training improves mllms, 2025\.URL[https://arxiv\.org/abs/2509\.25190](https://arxiv.org/abs/2509.25190)\.
- Yu et al\. \(2025\)Yu, S\., Chen, Y\., Ju, H\., Jia, L\., Zhang, F\., Huang, S\., Wu, Y\., Cui, R\., Ran, B\., Zhang, Z\., et al\.How far are VLMs from visual spatial intelligence? A benchmark\-driven perspective\.*arXiv preprint arXiv:2509\.18905*, 2025\.
- Yue et al\. \(2025\)Yue, Y\., Chen, Z\., Lu, R\., Zhao, A\., Wang, Z\., Song, S\., and Huang, G\.Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?*arXiv preprint arXiv:2504\.13837*, 2025\.
- Zhang et al\. \(2024\)Zhang, Y\., Xiao, F\., Huang, T\., Fan, C\.\-K\., Dong, H\., Li, J\., Wang, J\., Cheng, K\., Zhang, S\., and Guo, H\.Unveiling the tapestry of consistency in large vision\-language models\.*Advances in Neural Information Processing Systems*, 37:118632–118653, 2024\.
- Zhang et al\. \(2025\)Zhang, Z\., Zhu, J\., Ge, X\., Zhao, Z\., Zhou, Z\., Li, X\., Feng, X\., Yao, J\., and Han, B\.Co\-rewarding: Stable self\-supervised rl for eliciting reasoning in large language models\.*arXiv preprint arXiv:2508\.00410*, 2025\.
- Zhao et al\. \(2025\)Zhao, X\., Kang, Z\., Feng, A\., Levine, S\., and Song, D\.Learning to reason without external rewards, 2025\.URL[https://arxiv\.org/abs/2505\.19590](https://arxiv.org/abs/2505.19590)\.
- Zhou et al\. \(2016\)Zhou, T\., Krahenbuhl, P\., Aubry, M\., Huang, Q\., and Efros, A\. A\.Learning dense correspondence via 3d\-guided cycle consistency\.In*Proceedings of the IEEE conference on computer vision and pattern recognition*, pp\. 117–126, 2016\.
- Zhu et al\. \(2017\)Zhu, J\.\-Y\., Park, T\., Isola, P\., and Efros, A\. A\.Unpaired image\-to\-image translation using cycle\-consistent adversarial networks\.In*Proceedings of the IEEE international conference on computer vision*, pp\. 2223–2232, 2017\.
- Zuo et al\. \(2025\)Zuo, Y\., Zhang, K\., Sheng, L\., Qu, S\., Cui, G\., Zhu, X\., Li, H\., Zhang, Y\., Long, X\., Hua, E\., Qi, B\., Sun, Y\., Ma, Z\., Yuan, L\., Ding, N\., and Zhou, B\.Ttrl: Test\-time reinforcement learning, 2025\.URL[https://arxiv\.org/abs/2504\.16084](https://arxiv.org/abs/2504.16084)\.

## Appendix AFurther Analysis of the Minimal Consistency Pairing

### A\.1Consistency Pairing Under a Random Baseline

This appendix extends the analysis of pairing strategies introduced in[Section4](https://arxiv.org/html/2606.11918#S4)\. We investigate how different strategies behave under an*uninformative*model—one that guesses uniformly at random, ignoring the transformation relationship between prompts\. By deriving the expected per\-completion reward𝔼\[ri\]\\mathbb\{E\}\[r\_\{i\}\]under this random baseline, we compare how permissive each strategy is: high reward under random guessing indicates susceptibility to reward hacking, while low reward indicates a stronger learning signal\.

#### Setup and Notation\.

Letxxbe an original prompt andx′=T\(x\)x^\{\\prime\}=T\(x\)an augmented version obtained by applying a transformationTT\. We sampleKKbinary completions from each prompt:

y1,…,yK∼πθ\(⋅∣x\),y1′,…,yK′∼πθ\(⋅∣x′\),y\_\{1\},\\ldots,y\_\{K\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\),\\qquad y^\{\\prime\}\_\{1\},\\ldots,y^\{\\prime\}\_\{K\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x^\{\\prime\}\),where each completionyi,yj′∈\{True,False\}y\_\{i\},y^\{\\prime\}\_\{j\}\\in\\\{\\text\{True\},\\text\{False\}\\\}\.

As described in[Section3](https://arxiv.org/html/2606.11918#S3), each transformationTTinduces a known mappingϕT\\phi\_\{T\}on the answer space: ify⋆y^\{\\star\}is the correct answer toxx, thenϕT\(y⋆\)\\phi\_\{T\}\(y^\{\\star\}\)is the correct answer tox′x^\{\\prime\}\. For binary tasks,ϕT\\phi\_\{T\}is either the identity \(invariant transformations, where answers should match\) or negation \(equivariant transformations, where answers should differ\)\. The consistency verifier rewards completions that satisfy the expected relationship:

verifT\(y,y′\)=𝟏\{y′=ϕT\(y\)\}=\{𝟏\{y=y′\}ifϕT=id\(invariant\),𝟏\{y≠y′\}ifϕT=¬\(equivariant\)\.\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)=\\mathbf\{1\}\\\{y^\{\\prime\}=\\phi\_\{T\}\(y\)\\\}=\\begin\{cases\}\\mathbf\{1\}\\\{y=y^\{\\prime\}\\\}&\\text\{if \}\\phi\_\{T\}=\\mathrm\{id\}\\text\{ \(invariant\)\},\\\\ \\mathbf\{1\}\\\{y\\neq y^\{\\prime\}\\\}&\\text\{if \}\\phi\_\{T\}=\\neg\\text\{ \(equivariant\)\}\.\\end\{cases\}
For the analysis below, we consider the equivariant caseverifT\(y,y′\)=𝟏\{y≠y′\}\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)=\\mathbf\{1\}\\\{y\\neq y^\{\\prime\}\\\}without loss of generality; the invariant case is symmetric sinceℙ\(y=y′\)=ℙ\(y≠y′\)=1/2\\mathbb\{P\}\(y=y^\{\\prime\}\)=\\mathbb\{P\}\(y\\neq y^\{\\prime\}\)=1/2under independent unbiased bits\.

#### Pairing Strategies\.

GivenKKcompletions from each prompt, we have aK×KK\\times Kmatrix of verifier scoresVij=verifT\(yi,yj′\)V\_\{ij\}=\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{j\}\)\. Different strategies aggregate this matrix into per\-completion rewardsrir\_\{i\}\.

#### Random Pairing\.

Pair completions by generation order—theii\-th completion fromxxwith theii\-th completion fromx′x^\{\\prime\}\. The per\-completion reward is:

rirand=verifT\(yi,yi′\)\.r\_\{i\}^\{\\mathrm\{rand\}\}=\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{i\}\)\.This strategy is simple but arbitrary: the reward depends on the sampling order rather than the content of the completions\.

#### One\-to\-All\.

Compare each completion fromxxto every completion fromx′x^\{\\prime\}and average:

riall=1K∑j=1KverifT\(yi,yj′\)\.r\_\{i\}^\{\\mathrm\{all\}\}=\\frac\{1\}\{K\}\\sum\_\{j=1\}^\{K\}\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{j\}\)\.Averaging over all pairs reduces variance but can dilute the signal: a few strong inconsistencies may be hidden when most pairs happen to be consistent\.

#### Minimal Consistency \(OT\-GRPO\)\.

Find the permutationσ⋆∈𝒮K\\sigma^\{\\star\}\\in\\mathcal\{S\}\_\{K\}\(the symmetric group onKKelements\) that*minimizes*the total consistency score:

σ⋆∈argminσ∈𝒮K∑i=1KverifT\(yi,yσ\(i\)′\),riOT=verifT\(yi,yσ⋆\(i\)′\)\\sigma^\{\\star\}\\in\\operatorname\*\{arg\\,min\}\_\{\\sigma\\in\\mathcal\{S\}\_\{K\}\}\\sum\_\{i=1\}^\{K\}\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{\\sigma\(i\)\}\),\\quad r\_\{i\}^\{\\mathrm\{OT\}\}=\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{\\sigma^\{\\star\}\(i\)\}\)By construction, minimal consistency pairs each completion with its most challenging counterpart\.

#### Random Baseline\.

To calibrate these rewards, we evaluate their expected value when the model behaves as a random guesser\. Identifying True↦1\\mapsto 1and False↦0\\mapsto 0, we assume completions are independent unbiased bits:

yi∼iidBernoulli\(1/2\),yj′∼iidBernoulli\(1/2\),y\_\{i\}\\stackrel\{\{\\scriptstyle\\mathrm\{iid\}\}\}\{\{\\sim\}\}\\mathrm\{Bernoulli\}\(1/2\),\\qquad y^\{\\prime\}\_\{j\}\\stackrel\{\{\\scriptstyle\\mathrm\{iid\}\}\}\{\{\\sim\}\}\\mathrm\{Bernoulli\}\(1/2\),with independence across the two groups\.

Proposition A\.1:Expected Reward Under Random Baseline\.Under the random baseline, the expected per\-completion rewards for the three pairing strategies are:\(i\)Random pairing:𝔼\[rirand\]=12\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{rand\}\}\]=\\frac\{1\}\{2\},\(ii\)One\-to\-all:𝔼\[riall\]=12\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{all\}\}\]=\\frac\{1\}\{2\},\(iii\)Minimal consistency:𝔼\[riOT\]≈1πK\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{OT\}\}\]\\approx\\frac\{1\}\{\\sqrt\{\\pi K\}\}\.In particular, only minimal consistency penalizes random guessing, with expected reward vanishing asO\(1/K\)O\(1/\\sqrt\{K\}\)\.

###### Proof\.

\(i\) Random pairing\.For eachii, the pair\(yi,yi′\)\(y\_\{i\},y^\{\\prime\}\_\{i\}\)consists of two independent unbiased bits\. Hence,ℙ\(yi≠yi′\)=12\\mathbb\{P\}\(y\_\{i\}\\neq y^\{\\prime\}\_\{i\}\)=\\frac\{1\}\{2\}\. Since this holds for eachii, we have𝔼\[rirand\]=12\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{rand\}\}\]=\\frac\{1\}\{2\}\.

\(ii\) One\-to\-all\.The same argument applies to any pair\(yi,yj′\)\(y\_\{i\},y^\{\\prime\}\_\{j\}\): since all completions are independent,ℙ\(yi≠yj′\)=12\\mathbb\{P\}\(y\_\{i\}\\neq y^\{\\prime\}\_\{j\}\)=\\frac\{1\}\{2\}for alli,ji,j\. Therefore𝔼\[riall\]=12\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{all\}\}\]=\\frac\{1\}\{2\}\.

\(iii\) Minimal consistency\.This case is more subtle\. Define the counts:

S=∑i=1Kyi,S′=∑j=1Kyj′,S=\\sum\_\{i=1\}^\{K\}y\_\{i\},\\qquad S^\{\\prime\}=\\sum\_\{j=1\}^\{K\}y^\{\\prime\}\_\{j\},so thatSSandS′S^\{\\prime\}are the number of True in each group\. Under the random baseline,S,S′∼Bin\(K,1/2\)S,S^\{\\prime\}\\sim\\mathrm\{Bin\}\(K,1/2\)independently\.

To minimize the number of disagreeing pairs, the optimal strategy matches identical bits whenever possible\. SupposeS≥S′S\\geq S^\{\\prime\}\(the caseS′\>SS^\{\\prime\}\>Sis symmetric\)\. We match allS′S^\{\\prime\}True values from the augmented group with True values from the original group \(contributing 0 to the verifier\), and similarly for False values\. The remainingS−S′S\-S^\{\\prime\}True values must pair with False values, creating exactly\|S−S′\|\|S\-S^\{\\prime\}\|disagreements:

minσ∈𝒮K∑i=1K𝟏\{yi≠yσ\(i\)′\}=\|S−S′\|\.\\min\_\{\\sigma\\in\\mathcal\{S\}\_\{K\}\}\\sum\_\{i=1\}^\{K\}\\mathbf\{1\}\\\{y\_\{i\}\\neq y^\{\\prime\}\_\{\\sigma\(i\)\}\\\}=\|S\-S^\{\\prime\}\|\.The average reward is therefore1K∑iriOT=\|S−S′\|/K\\frac\{1\}\{K\}\\sum\_\{i\}r\_\{i\}^\{\\mathrm\{OT\}\}=\|S\-S^\{\\prime\}\|/K\. To compute its expectation, we analyze\|S−S′\|\|S\-S^\{\\prime\}\|\. Write the difference as a sum of centered terms:

S−S′=∑i=1K\(yi−12\)−∑j=1K\(yj′−12\),S\-S^\{\\prime\}=\\sum\_\{i=1\}^\{K\}\\Big\(y\_\{i\}\-\\tfrac\{1\}\{2\}\\Big\)\-\\sum\_\{j=1\}^\{K\}\\Big\(y^\{\\prime\}\_\{j\}\-\\tfrac\{1\}\{2\}\\Big\),where each summand\(yi−12\)\(y\_\{i\}\-\\frac\{1\}\{2\}\)or\(yj′−12\)\(y^\{\\prime\}\_\{j\}\-\\frac\{1\}\{2\}\)equals±12\\pm\\frac\{1\}\{2\}with equal probability\. This is a sum of2K2Kindependent, mean\-zero, bounded random variables\. AsS,S′∼Bin\(K,1/2\)S,S^\{\\prime\}\\sim\\mathrm\{Bin\}\(K,1/2\)independently, the variance is:

Var\(S−S′\)=K⋅14\+K⋅14=K2\.\\mathrm\{Var\}\(S\-S^\{\\prime\}\)=K\\cdot\\frac\{1\}\{4\}\+K\\cdot\\frac\{1\}\{4\}=\\frac\{K\}\{2\}\.By the central limit theorem:

S−S′K/2→𝑑𝒩\(0,1\)\.\\frac\{S\-S^\{\\prime\}\}\{\\sqrt\{K/2\}\}\\xrightarrow\{d\}\\mathcal\{N\}\(0,1\)\.For a random variableZ∼𝒩\(0,σ2\)Z\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\), we have𝔼\|Z\|=σ2/π\\mathbb\{E\}\|Z\|=\\sigma\\sqrt\{2/\\pi\}\. Takingσ=K/2\\sigma=\\sqrt\{K/2\}gives:

𝔼\|S−S′\|≈K2⋅2π=Kπ\.\\mathbb\{E\}\|S\-S^\{\\prime\}\|\\approx\\sqrt\{\\frac\{K\}\{2\}\}\\cdot\\sqrt\{\\frac\{2\}\{\\pi\}\}=\\sqrt\{\\frac\{K\}\{\\pi\}\}\.Therefore:

𝔼\[riOT\]=𝔼\|S−S′\|K≈1πK\.\\mathbb\{E\}\[r\_\{i\}^\{\\mathrm\{OT\}\}\]=\\frac\{\\mathbb\{E\}\|S\-S^\{\\prime\}\|\}\{K\}\\approx\\frac\{1\}\{\\sqrt\{\\pi K\}\}\.∎

Table[2](https://arxiv.org/html/2606.11918#A1.T2)summarizes these results\. Figure[9](https://arxiv.org/html/2606.11918#A1.F9)visualizes this comparison\. For typical group sizes \(K=8K=8to3232\), minimal consistency yields expected rewards between0\.200\.20and0\.100\.10—well below the0\.500\.50baseline of the other strategies\.

Table 2:Expected per\-completion reward under the random baseline \(uninformative model\)\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x9.png)Figure 9:Expected per\-completion reward under the random baseline as a function of group sizeKK\. Random pairing and one\-to\-all remain at1/21/2regardless ofKK, while minimal consistency decays as1/πK1/\\sqrt\{\\pi K\}\.
#### Interpretation\.

Only minimal consistency penalizes uninformative models: asKKgrows, the expected reward for a random guesser vanishes asO\(1/K\)O\(1/\\sqrt\{K\}\)\. The key insight is that minimal consistency actively searches for the*worst\-case*pairing\. A random model produces roughly equal numbers of True and False in each group, so the optimal matching pairs most completions with identical answers \(which fail the verifier\)\. Only the “leftover” completions—the imbalance\|S−S′\|∼K\|S\-S^\{\\prime\}\|\\sim\\sqrt\{K\}—contribute to the reward\. In contrast, random pairing and one\-to\-all treat all pairs equally, allowing a random model to benefit from the50%50\\%chance that any given pair happens to satisfy the verifier\.

#### Conclusion\.

Minimal consistency makes the consistency verification task intrinsically harder and the reward signal less susceptible to random guessing\. This property is particularly valuable in our label\-free setting, where we cannot rely on ground\-truth accuracy to filter out uninformative behavior\. By using minimal consistency, we ensure that high rewards reflect genuine cross\-prompt agreement rather than statistical coincidence\.

### A\.2Wasserstein Reformulation

Beyond the random\-baseline analysis, minimal consistency admits an interpretation through optimal transport theory\. We show that maximizing the minimal\-consistency reward is equivalent to minimizing a Wasserstein distance between the model’s output distributions onxxandx′x^\{\\prime\}, with transport cost given by the negative verifiercT=−verifTc\_\{T\}=\-\\text\{verif\}\_\{T\}\.

Definition A\.2:Wasserstein Distance\.Given two probability measuresπ\\piandπ′\\pi^\{\\prime\}on a space𝒴\\mathcal\{Y\}, and a cost functionc:𝒴×𝒴→ℝc:\\mathcal\{Y\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}, the*Wasserstein distance*\(or optimal transport cost\) is:Wc\(π,π′\)≔minγ∈Γ\(π,π′\)⁡𝔼\(y,y′\)∼γ\[c\(y,y′\)\],W\_\{c\}\(\\pi,\\pi^\{\\prime\}\)\\coloneqq\\min\_\{\\gamma\\in\\Gamma\(\\pi,\\pi^\{\\prime\}\)\}\\mathbb\{E\}\_\{\(y,y^\{\\prime\}\)\\sim\\gamma\}\\big\[c\(y,y^\{\\prime\}\)\\big\],whereΓ\(π,π′\)\\Gamma\(\\pi,\\pi^\{\\prime\}\)denotes the set of couplings—joint distributions on𝒴×𝒴\\mathcal\{Y\}\\times\\mathcal\{Y\}with marginalsπ\\piandπ′\\pi^\{\\prime\}\.

Proposition A\.3:Wasserstein Reformulation of OT\-GRPO\.Letπθx≔πθ\(⋅∣x\)\\pi\_\{\\theta\}^\{x\}\\coloneqq\\pi\_\{\\theta\}\(\\cdot\\mid x\)andπθx′≔πθ\(⋅∣x′\)\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\\coloneqq\\pi\_\{\\theta\}\(\\cdot\\mid x^\{\\prime\}\)denote the model’s output distributions, and define the inconsistency costcT≔−verifTc\_\{T\}\\coloneqq\-\\text\{verif\}\_\{T\}\. Then the OT\-GRPO objective is equivalent to minimizing the expected Wasserstein distance:minθ⁡𝔼x∼𝒟,T∼𝒯x′=T\(x\)\[WcT\(πθx,πθx′\)\]\.\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\},\\,T\\sim\\mathcal\{T\}\\\\ x^\{\\prime\}=T\(x\)\\end\{subarray\}\}\\;\\big\[W\_\{c\_\{T\}\}\(\\pi\_\{\\theta\}^\{x\},\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\)\\big\]\.

###### Proof\.

The initial pairwise consistency objective from[Section4](https://arxiv.org/html/2606.11918#S4)is:

maxθ⁡𝔼x∼𝒟,T∼𝒯x′=T\(x\)𝔼y∼πθ\(⋅∣x\)y′∼πθ\(⋅∣x′\)\[verifT\(y,y′\)\]\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\},\\,T\\sim\\mathcal\{T\}\\\\ x^\{\\prime\}=T\(x\)\\end\{subarray\}\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\\\ y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x^\{\\prime\}\)\\end\{subarray\}\}\\big\[\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)\\big\]\.In this formulation,yyandy′y^\{\\prime\}are sampled independently\. With minimal consistency pairing, however, completions are matched via the optimal transport couplingγ⋆\\gamma^\{\\star\}that minimizes expected consistency:

γ⋆=argminγ∈Γ\(πθx,πθx′\)⁡𝔼\(y,y′\)∼γ\[verifT\(y,y′\)\]\.\\gamma^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{\\gamma\\in\\Gamma\(\\pi\_\{\\theta\}^\{x\},\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\)\}\\mathbb\{E\}\_\{\(y,y^\{\\prime\}\)\\sim\\gamma\}\\big\[\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)\\big\]\.Substituting this adversarial coupling yields a max\-min objective:

maxθ⁡𝔼x∼𝒟,T∼𝒯x′=T\(x\)minγ∈Γ\(πθx,πθx′\)⁡𝔼\(y,y′\)∼γ\[verifT\(y,y′\)\]\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\},\\,T\\sim\\mathcal\{T\}\\\\ x^\{\\prime\}=T\(x\)\\end\{subarray\}\}\\;\\min\_\{\\gamma\\in\\Gamma\(\\pi\_\{\\theta\}^\{x\},\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\)\}\\mathbb\{E\}\_\{\(y,y^\{\\prime\}\)\\sim\\gamma\}\\big\[\\text\{verif\}\_\{T\}\(y,y^\{\\prime\}\)\\big\]\.Recognizing the inner minimization as the Wasserstein distanceWverifT\(πθx,πθx′\)W\_\{\\text\{verif\}\_\{T\}\}\(\\pi\_\{\\theta\}^\{x\},\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\):

maxθ⁡𝔼x∼𝒟,T∼𝒯x′=T\(x\)WverifT\(πθx,πθx′\)\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\},\\,T\\sim\\mathcal\{T\}\\\\ x^\{\\prime\}=T\(x\)\\end\{subarray\}\}\\;W\_\{\\text\{verif\}\_\{T\}\}\(\\pi\_\{\\theta\}^\{x\},\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\)\.Converting maximization to minimization by negating:

minθ⁡𝔼x∼𝒟,T∼𝒯x′=T\(x\)\(−WverifT\(πθx,πθx′\)\)\.\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\},\\,T\\sim\\mathcal\{T\}\\\\ x^\{\\prime\}=T\(x\)\\end\{subarray\}\}\\;\\big\(\-W\_\{\\text\{verif\}\_\{T\}\}\(\\pi\_\{\\theta\}^\{x\},\\pi\_\{\\theta\}^\{x^\{\\prime\}\}\)\\big\)\.Finally, absorbing the negative into the cost gives−WverifT=W−verifT=WcT\-W\_\{\\text\{verif\}\_\{T\}\}=W\_\{\-\\text\{verif\}\_\{T\}\}=W\_\{c\_\{T\}\}, completing the proof\. ∎

#### Interpretation\.

This reformulation reveals that OT\-GRPO seeks to*align*the model’s conditional distributionsπθx\\pi\_\{\\theta\}^\{x\}andπθx′\\pi\_\{\\theta\}^\{x^\{\\prime\}\}in the geometry defined by the Wasserstein distance with inconsistency costcT=−verifTc\_\{T\}=\-\\text\{verif\}\_\{T\}\.

## Appendix BImplementation Details

### B\.1System Prompt

As described in[Section4](https://arxiv.org/html/2606.11918#S4), we prompt the model with a system instruction adapted from DeepSeek\-R1\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.11918#bib.bib14)\)to encourage chain\-of\-thought reasoning\. The exact system prompt is:

> “A conversation between User and Assistant\. The user asks a question, and the Assistant solves it\. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer\. The reasoning process and answer are enclosed within<think\></think\>and<answer\></answer\>tags, respectively, i\.e\.,<think\>reasoning process here</think\><answer\>answer here</answer\>\.”

The format reward is 1 if the completion contains both a valid<think\>\.\.\.</think\>block and a valid<answer\>\.\.\.</answer\>block with the answer being either “True” or “False”\. If either component is missing or malformed, the format reward is 0\.

### B\.2Handling Unparseable Answers

During training, some completions may fail to produce a valid “True” or “False” answer \(e\.g\., due to malformed output or refusal to answer\)\. We handle these*unparseable*completions as follows:

- •Unparseable completions areexcluded from the OT matching\. The optimal transport problem is solved only over completions with valid parsed answers\.
- •Unparseable completions receive aconsistency reward of 0, which—after group normalization—translates to a negative advantage, discouraging the model from producing unparseable outputs\.

This design ensures that the OT matching remains well\-defined even when some completions are invalid, while providing a learning signal that encourages properly formatted outputs\. The formal treatment of the matching problem—including how we handle unequal group sizes—is given in[SectionB\.3](https://arxiv.org/html/2606.11918#A2.SS3)\.

### B\.3Discrete Optimal Transport for Completion Matching

This section details how we solve the minimal consistency matching in practice, particularly when unparseable completions lead to unequal group sizes\. As discussed in[SectionB\.2](https://arxiv.org/html/2606.11918#A2.SS2), unparseable completions are excluded, leavingn0n\_\{0\}valid completions\{y1,…,yn0\}\\\{y\_\{1\},\\ldots,y\_\{n\_\{0\}\}\\\}from the original prompt andn1n\_\{1\}valid completions\{y1′,…,yn1′\}\\\{y^\{\\prime\}\_\{1\},\\ldots,y^\{\\prime\}\_\{n\_\{1\}\}\\\}from the augmented prompt\.

Definition B\.1:Discrete Optimal Transport for Completion Matching\.Givenn0n\_\{0\}valid completions\{y1,…,yn0\}\\\{y\_\{1\},\\ldots,y\_\{n\_\{0\}\}\\\}from the original prompt andn1n\_\{1\}valid completions\{y1′,…,yn1′\}\\\{y^\{\\prime\}\_\{1\},\\ldots,y^\{\\prime\}\_\{n\_\{1\}\}\\\}from the augmented prompt, we seek a coupling matrixγ∈ℝn0×n1\\gamma\\in\\mathbb\{R\}^\{n\_\{0\}\\times n\_\{1\}\}that minimizes the total consistency cost:γ⋆=argminγ≥0∑i=1n0∑j=1n1γij⋅verifT\(yi,yj′\)s\.t\.∑j=1n1γij=1n0,∑i=1n0γij=1n1\.\\gamma^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{\\gamma\\geq 0\}\\sum\_\{i=1\}^\{n\_\{0\}\}\\sum\_\{j=1\}^\{n\_\{1\}\}\\gamma\_\{ij\}\\cdot\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{j\}\)\\quad\\text\{s\.t\.\}\\quad\\sum\_\{j=1\}^\{n\_\{1\}\}\\gamma\_\{ij\}=\\frac\{1\}\{n\_\{0\}\},\\quad\\sum\_\{i=1\}^\{n\_\{0\}\}\\gamma\_\{ij\}=\\frac\{1\}\{n\_\{1\}\}\.The constraints ensure mass conservation, namely, that each original completion has total outgoing mass1/n01/n\_\{0\}and each augmented completion has total incoming mass1/n11/n\_\{1\}\.

#### Equal group sizes \(n0=n1=Kn\_\{0\}=n\_\{1\}=K\)\.

When all completions are parseable, the feasible set is the*Birkhoff polytope*—the set of doubly stochastic matrices \(scaled by1/K1/K\)\. By the Birkhoff–von Neumann theorem\(Birkhoff,[1946](https://arxiv.org/html/2606.11918#bib.bib3)\), its vertices are precisely the permutation matrices\. Since we minimize a linear cost, the optimumγ⋆\\gamma^\{\\star\}is attained at a vertex, corresponding to a permutationσ⋆∈𝒮K\\sigma^\{\\star\}\\in\\mathcal\{S\}\_\{K\}\. This recovers the formulation in[Section4](https://arxiv.org/html/2606.11918#S4)\.

#### Unequal group sizes \(n0≠n1n\_\{0\}\\neq n\_\{1\}\)\.

When some completions are unparseable \(by not following the desired format\), the marginal constraints differ and the feasible set is no longer the Birkhoff polytope\. The optimal couplingγ⋆\\gamma^\{\\star\}may assign fractional mass from one completion to multiple partners\.

#### Extracting deterministic assignments\.

To obtain per\-completion rewards, we extract a deterministic assignment fromγ⋆\\gamma^\{\\star\}via argmax:

Foryi:j⋆\(i\)=argmaxj∈\{1,…,n1\}γij⋆,Foryj′:i⋆\(j\)=argmaxi∈\{1,…,n0\}γij⋆\.\\text\{For \}y\_\{i\}:\\quad j^\{\\star\}\(i\)=\\operatorname\*\{arg\\,max\}\_\{j\\in\\\{1,\\ldots,n\_\{1\}\\\}\}\\gamma^\{\\star\}\_\{ij\},\\qquad\\text\{For \}y^\{\\prime\}\_\{j\}:\\quad i^\{\\star\}\(j\)=\\operatorname\*\{arg\\,max\}\_\{i\\in\\\{1,\\ldots,n\_\{0\}\\\}\}\\gamma^\{\\star\}\_\{ij\}\.The reward foryiy\_\{i\}isverifT\(yi,yj⋆\(i\)′\)\\text\{verif\}\_\{T\}\(y\_\{i\},y^\{\\prime\}\_\{j^\{\\star\}\(i\)\}\), and similarly foryj′y^\{\\prime\}\_\{j\}\. Whenn0=n1n\_\{0\}=n\_\{1\}, the optimal coupling is a permutation and argmax recovers the unique partner\. Whenn0≠n1n\_\{0\}\\neq n\_\{1\}, argmax selects the partner receiving the largest transport mass\.

This formulation gracefully handles missing data: with all completions parseable, we recover the permutation matching; with some unparseable, we solve the unbalanced OT problem and extract assignments accordingly\.

## Appendix CQualitative Example

Figure[10](https://arxiv.org/html/2606.11918#A3.F10)shows the model’s reasoning before and after consistency training on a depth comparison task from SUN RGB\-D\. Before training, the model relies on a flawed heuristic \(vertical position in the image\) and produces an incorrect answer\. After training, the model correctly reasons about 3D spatial relationships and arrives at the correct answer\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x10.png)Figure 10:Qualitative example showing how consistency training improves spatial reasoning\. Given a depth comparison question from SUN RGB\-D, the Qwen2\.5\-VL\-7B model initially produces an incorrect answer based on flawed heuristics \(left\), while after training, the model reasons correctly about 3D spatial relationships and arrives at the correct answer \(right\)\.
## Appendix DDataset Construction

Spatial reasoning tasks require unambiguous ground\-truth answers to train and evaluate models effectively\. We construct our datasets from images with 3D object annotations, applying careful filtering to select object pairs with clear spatial relationships\. This section describes the data sources, task definitions, selection criteria, and transformations used to generate training examples\.

### D\.1Source Data

#### Omni3D Annotations\.

Spatial reasoning tasks require ground\-truth 3D information to establish unambiguous answers\. We use annotations from Omni3D\(Brazil et al\.,[2023](https://arxiv.org/html/2606.11918#bib.bib4)\), a large\-scale benchmark that provides unified 3D bounding boxes across multiple RGB datasets\. Each annotation includes a 9\-DoF 3D bounding box \(position, orientation, dimensions\) in camera coordinates, along with 2D projections and visibility estimates\. All 3D quantities are expressed in camera coordinates, where the x\-axis points right, the y\-axis points down, and the z\-axis points forward into the scene\. Table[3](https://arxiv.org/html/2606.11918#A4.T3)compares the two source datasets\.

Table 3:Comparison of source datasets\. KITTI provides outdoor driving scenes with large depth ranges; SUN RGB\-D provides cluttered indoor scenes with diverse object arrangements\.
#### KITTI \(Outdoor\)\.

We use KITTI\(Geiger et al\.,[2012](https://arxiv.org/html/2606.11918#bib.bib16)\), an autonomous driving benchmark with forward\-facing stereo images captured from a moving vehicle\. Objects appear at depths from 5 to 100 meters, and the clear lane structure provides unambiguous left/right relationships\.

#### SUN RGB\-D \(Indoor\)\.

We use SUN RGB\-D\(Song et al\.,[2015](https://arxiv.org/html/2606.11918#bib.bib24)\), a collection of RGB\-D data from diverse indoor environments\. Objects appear at closer range \(0\.5–10 m\), scenes are more cluttered with frequent occlusions, and spatial arrangements are more varied\. The contrast between domains allows us to evaluate cross\-domain generalization\.

### D\.2Task Definitions

We define four spatial reasoning tasks, each formulated as a binary True/False question\. All tasks follow the same structure: given an image with highlighted objects, the model must determine whether a stated spatial relationship holds\. The tasks are designed to cover different aspects of 3D spatial understanding: horizontal position \(orientation\), camera\-relative depth, inter\-object distance, and physical size\.

Table 4:Summary of spatial reasoning tasks\. Each task is a binary True/False question about the spatial relationship between highlighted objects\.Table 5:Dataset statistics\. Number of training examples per task and source dataset, after filtering\. Tasks may share images, so counts should not be summed across tasks\.#### Orientation\.

Ground truth is determined by comparing the x\-coordinates of 2D bounding box centers\. We use colored dots as markers to avoid revealing positional information through bounding box placement\.

#### Depth\.

Ground truth is the minimum z\-coordinate across all eight 3D bounding box corners \(using the minimum rather than the center handles objects extending toward the camera\)\. We use bounding boxes as markers, since 2D box size correlates only weakly with depth\.

#### Size\.

Ground truth is 3D volume \(width×\\timesheight×\\timeslength\)\. We use dots as markers to avoid revealing size through bounding box dimensions\.

#### Relative Distance\.

This task involves three objects: an anchor and two comparison objects\. Ground truth is the Euclidean distance between object centers\. We select the anchor to maximize the distance gap between comparison objects\.

#### Linguistic Variation\.

For each task, we define five question templates that paraphrase the same underlying question and two relation phrases \(e\.g\., “left of” / “right of”\)\. During training, templates and relations are sampled to encourage learning the spatial concept rather than pattern\-matching specific phrasings\. Table[6](https://arxiv.org/html/2606.11918#A4.T6)lists all templates\.

Table 6:Question templates and relation phrases for each task\.\{IDX0\}and\{IDX1\}are placeholders for object indices;\{REL\}is replaced by one of the relation phrases listed\.
#### Answer Balance\.

Because we sample relation phrases and object orders uniformly at random, the ground\-truth answers are balanced: approximately 50% True and 50% False for each task in both datasets\.

### D\.3Object and Pair Selection

Not all object pairs yield useful training examples\. Heavily occluded objects may be difficult to identify; objects with nearly identical depths create ambiguous comparisons; overlapping bounding boxes confuse left/right judgments\. We apply a series of filters to select high\-quality examples\.

#### Single\-Object Filters\.

Each object must satisfy validity criteria before being considered for pairing\. Table[7](https://arxiv.org/html/2606.11918#A4.T7)summarizes the thresholds\. We require sufficient visibility \(using the Omni3D visibility score\), reasonable 2D bounding box size \(large enough to identify, small enough not to dominate the frame\), and—for KITTI—that the 3D center lies in front of the camera plane\.

Table 7:Single\-object filter thresholds\. Objects failing any criterion are excluded\.
#### Pair and Group Filters\.

When two objects overlap significantly, spatial relationships become ambiguous\. We filter pairs by bounding box IoU \(intersection over union\) and coverage \(fraction of one box contained in another\)\. Stricter thresholds apply to same\-class pairs, where visual confusion is more likely\. Each task also requires sufficient separation along its relevant dimension\. Table[8](https://arxiv.org/html/2606.11918#A4.T8)lists the thresholds\.

Table 8:Pair/group filter thresholds\. Pairs exceeding overlap limits or failing to meet minimum gaps are excluded\.FilterKITTISUN RGB\-DMax IoU \(any class\)35%30%Max IoU \(same class\)15%10%Max coverage \(any class\)35%30%Max coverage \(same class\)15%10%Min depth gap \(Depth task\)0\.5 m0\.3 mMin volume gap \(Size task\)1\.0 m30\.5 m3Min distance gap \(Rel\. Dist\.\)0\.5 m0\.3 m
#### Selection Strategy\.

After filtering, multiple valid pairs typically remain per image\. We score each pair based on the metric gap and object size, then select according to a task\-specific strategy\.

### D\.4Transformations for Consistency Training

The consistency training approach described in[Section4](https://arxiv.org/html/2606.11918#S4)requires paired prompts where the relationship between correct answers is known without access to ground\-truth labels\. We achieve this through transformations applied to both images and questions, each with a known effect on the correct answer\.

#### Image Transformations\.

We apply three types of image transformations\. Horizontal flips mirror the image left\-to\-right, swapping the horizontal positions of all objects; we update the stored 2D and 3D bounding box coordinates accordingly\. Bounding\-box\-preserving crops randomly select a sub\-region of the image while ensuring all annotated objects remain fully visible; the crop scale ranges from 70% to 100% of the original image\. Color adjustments \(brightness, contrast, saturation\) modify the image appearance without affecting spatial relationships\.

#### Text Transformations\.

We vary the question formulation through three mechanisms\. Template sampling selects among the five paraphrased question templates for each task\. Relation swapping replaces the queried relation with its opposite \(e\.g\., “closer to” becomes “further from”\)\. Object order swapping permutes which object is referred to as “object 1” versus “object 2” in the question\. Each mechanism changes the surface form of the question while preserving or predictably altering its meaning\.

#### Invariant Transformations\.

Some transformations preserve the correct answer\. Color adjustments do not affect spatial relationships, so the answer remains unchanged\. Bounding\-box\-preserving crops maintain all objects and their relative positions, leaving the answer unchanged\. These invariant transformations create prompt pairs where the consistency verifier expects matching answers\.

#### Equivariant Transformations\.

Other transformations predictably change the correct answer\. A horizontal flip swaps left and right, negating the answer to orientation questions—if A was left of B before the flip, A is right of B after\. Relation swapping negates the answer by asking the opposite question\. Object order swapping negates the answer for symmetric relations: “Is A closer than B?” and “Is B closer than A?” have opposite answers\. These equivariant transformations create prompt pairs where the verifier expects opposite answers\. Table[9](https://arxiv.org/html/2606.11918#A4.T9)summarizes the transformation properties\.

Table 9:Transformation properties\. Invariant transformations preserve the answer; equivariant transformations negate it\.
#### Composition Rule\.

When multiple transformations are applied simultaneously, their effects compose according to a simple rule: each equivariant transformation contributes one negation, and an even number of negations cancels out\. For example, applying both a horizontal flip and a relation swap results in two negations, so the final answer matches the original\. Applying a flip, a relation swap, and an object order swap results in three negations, so the final answer is negated\. This composition rule allows the consistency verifier to determine the expected answer relationship for any combination of transformations\.

#### Sampling Strategy\.

In practice, we use all transformations for all tasks\. Each transformation is applied independently with probability 0\.5, and the sampled transformations are composed\. This stochastic composition ensures diverse augmentation during training while maintaining the ability to compute the expected answer relationship via the composition rule\.

With the filtered object pairs, question templates, and transformation definitions in place, we have all the building blocks needed to generate paired prompts at training time\. During each training step, transformations are sampled and applied on\-the\-fly to create the original and augmented prompt pairs used by the consistency verifier\.

### D\.5Example Prompt Pairs

Figures[11](https://arxiv.org/html/2606.11918#A4.F11)–[14](https://arxiv.org/html/2606.11918#A4.F14)show example prompt pairs for each of the four spatial reasoning tasks\. Each figure displays an original prompt \(left\) alongside an augmented prompt \(right\) created by applying various image and text transformations\. The examples demonstrate different transformation combinations: some use only invariant transforms \(where answers should match\), while others include equivariant transforms that predictably change the answer\. The prompts show the object descriptions and questions presented to the model; the colored markers \(boxes or dots\) indicate the referenced objects\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x11.png)\- object 1 = "table", highlighted by a red box\.
\- object 2 = "printer", highlighted by a blue box\.
Is object 1 closer to the camera than object 2?
Answer: True
![Refer to caption](https://arxiv.org/html/2606.11918v1/x12.png)\- object 1 = "table", highlighted by a red box\.
\- object 2 = "printer", highlighted by a blue box\.
Is object 1 further from the camera than object 2?
Answer: False

Figure 11:Depth task example\. The augmented prompt applies color jitter and relation swap \(“closer to”→\\to“further from”\), but no horizontal flip\. Since relation swap is a single equivariant transformation \(one negation\), theanswers should differ\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x13.png)\- object 1 = "table", marked with a red dot\.
\- object 2 = "chair", marked with a blue dot\.
Is object 1 to the left of object 2?
Answer: False
![Refer to caption](https://arxiv.org/html/2606.11918v1/x14.png)\- object 1 = "table", marked with a red dot\.
\- object 2 = "chair", marked with a blue dot\.
Is object 1 to the right of object 2?
Answer: False

Figure 12:Orientation task example\. The augmented prompt applies horizontal flip, color jitter, and relation swap \(“left of”→\\to“right of”\)\. The flip negates the spatial relationship, and the relation swap negates the question—two equivariant transformations that cancel out, so theanswers should match\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x15.png)\- object 1 = "chair", marked with a red dot\.
\- object 2 = "bookcase", marked with a blue dot\.
Is object 1 bigger than object 2?
Answer: False
![Refer to caption](https://arxiv.org/html/2606.11918v1/x16.png)\- object 1 = "chair", marked with a red dot\.
\- object 2 = "bookcase", marked with a blue dot\.
Compared to object 2, is object 1 bigger than?
Answer: False

Figure 13:Size task example\. The augmented prompt applies an aggressive crop \(50–60% scale\), color jitter, and a different question template—but no relation swap\. All transformations are invariant \(zero negations\), so theanswers should match\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x17.png)\- object 1 = "chair", highlighted by a red box\.
\- object 2 = "table", highlighted by a blue box\.
\- object 3 = "person", highlighted by a green box\.
Is object 2 closer to object 1 than object 3?
Answer: True
![Refer to caption](https://arxiv.org/html/2606.11918v1/x18.png)\- object 1 = "chair", highlighted by a red box\.
\- object 2 = "table", highlighted by a blue box\.
\- object 3 = "person", highlighted by a green box\.
Is object 2 further from object 1 than object 3?
Answer: False

Figure 14:Relative distance task example \(triplet\)\. The augmented prompt applies horizontal flip, color jitter, and relation swap \(“closer to”→\\to“further from”\)\. The flip does not affect inter\-object distances, so only the relation swap contributes a negation—theanswers should differ\.
### D\.6KITTI Example Prompt Pairs

Figures[15](https://arxiv.org/html/2606.11918#A4.F15)–[18](https://arxiv.org/html/2606.11918#A4.F18)show example prompt pairs from the KITTI dataset \(outdoor driving scenes\)\. KITTI images feature sparser layouts with objects at greater depths \(5–100 m\), predominantly containing vehicles \(cars, vans, trucks\) and pedestrians\. The same transformation strategies apply as in SUN RGB\-D\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x19.png)\- object 1 = "car", highlighted by a red box\.
\- object 2 = "car", highlighted by a blue box\.
Is object 1 closer to the camera than object 2?
Answer: True
![Refer to caption](https://arxiv.org/html/2606.11918v1/x20.png)\- object 1 = "car", highlighted by a red box\.
\- object 2 = "car", highlighted by a blue box\.
Is object 1 further from the camera than object 2?
Answer: False

Figure 15:KITTI depth task example\. The augmented prompt applies color jitter and relation swap \(“closer to”→\\to“further from”\), but no horizontal flip\. Since relation swap is a single equivariant transformation \(one negation\), theanswers should differ\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x21.png)\- object 1 = "car", marked with a red dot\.
\- object 2 = "car", marked with a blue dot\.
Is object 1 to the left of object 2?
Answer: False
![Refer to caption](https://arxiv.org/html/2606.11918v1/x22.png)\- object 1 = "car", marked with a red dot\.
\- object 2 = "car", marked with a blue dot\.
Is object 1 to the right of object 2?
Answer: False

Figure 16:KITTI orientation task example\. The augmented prompt applies horizontal flip, color jitter, and relation swap \(“left of”→\\to“right of”\)\. The flip negates the spatial relationship, and the relation swap negates the question—two equivariant transformations that cancel out, so theanswers should match\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x23.png)\- object 1 = "car", marked with a red dot\.
\- object 2 = "cyclist", marked with a blue dot\.
Is object 1 bigger than object 2?
Answer: False
![Refer to caption](https://arxiv.org/html/2606.11918v1/x24.png)\- object 1 = "car", marked with a red dot\.
\- object 2 = "cyclist", marked with a blue dot\.
Compared to object 2, is object 1 larger?
Answer: False

Figure 17:KITTI size task example\. The augmented prompt applies a bounding\-box\-aware crop, color jitter, and a different question template—but no relation swap\. All transformations are invariant \(zero negations\), so theanswers should match\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x25.png)\- object 1 = "car", highlighted by a red box\.
\- object 2 = "car", highlighted by a blue box\.
\- object 3 = "car", highlighted by a green box\.
Is object 2 closer to object 1 than object 3?
Answer: True
![Refer to caption](https://arxiv.org/html/2606.11918v1/x26.png)\- object 1 = "car", highlighted by a red box\.
\- object 2 = "car", highlighted by a blue box\.
\- object 3 = "car", highlighted by a green box\.
Is object 2 further from object 1 than object 3?
Answer: False

Figure 18:KITTI relative distance task example \(triplet\)\. The augmented prompt applies horizontal flip, color jitter, and relation swap \(“closer to”→\\to“further from”\)\. The flip does not affect inter\-object distances, so only the relation swap contributes a negation—theanswers should differ\.
### D\.7Numeric Task Example Prompt Pairs

Figures[19](https://arxiv.org/html/2606.11918#A4.F19)–[20](https://arxiv.org/html/2606.11918#A4.F20)show example prompt pairs for the two numeric tasks introduced in[Section5](https://arxiv.org/html/2606.11918#S5)\. As discussed there, the answer \(an integer count or a metric distance\) is invariant under every transformation we apply, so the consistency verifier expects the original and augmented predictions to be*equal*—there is no equivariant case to consider\.

![Refer to caption](https://arxiv.org/html/2606.11918v1/x27.png)How many chairs are visible in the image?
Answer: 3
![Refer to caption](https://arxiv.org/html/2606.11918v1/x28.png)In the image, how many chairs can you see?
Answer: 3

Figure 19:Counting task example\. The augmented prompt applies an object\-preserving crop, color jitter, and template resampling\. All transformations are invariant for counting \(the number of objects of a given class is unchanged\), so theanswers should match\.![Refer to caption](https://arxiv.org/html/2606.11918v1/x29.png)\- object 1 = "table", marked with a red dot\.
\- object 2 = "chair", marked with a blue dot\.
What is the distance between object 1 and object 2 in meters?
Answer: 1\.4
![Refer to caption](https://arxiv.org/html/2606.11918v1/x30.png)\- object 1 = "table", marked with a red dot\.
\- object 2 = "chair", marked with a blue dot\.
In meters, how far apart are object 1 and object 2?
Answer: 1\.4

Figure 20:Absolute distance task example\. The augmented prompt applies horizontal flip, color jitter, and template resampling\. Mirroring and visual jitter leave the 3D distance between two objects unchanged and the question paraphrasing keeps its meaning, so theanswers should match\.
The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

Similar Articles

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

What properties of reasoning supervision are associated with improved downstream model quality?

Submit Feedback

Similar Articles

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning
Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
What properties of reasoning supervision are associated with improved downstream model quality?