Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Hugging Face Daily Papers Papers

Summary

CorVer is a lightweight, corpus-grounded reward mechanism that uses Wikipedia co-occurrence statistics to provide efficient sentence-level feedback for reinforcement learning in factual question answering, outperforming neural verifiers while training 4.8 to 8.4x faster.

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.
Original Article
View Cached Full Text

Cached at: 05/29/26, 07:00 AM

Paper page - Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Source: https://huggingface.co/papers/2605.29648

Abstract

CorVer, a corpus-grounded reward mechanism, enhances factual accuracy in question answering by providing efficient sentence-level feedback through Wikipedia co-occurrence statistics, outperforming neural verifiers while reducing training time.

Applyingreinforcement learningto improve factual accuracy inknowledge-intensive question answeringfaces areward designdilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely onNLI verifiers,LLM judges, orknowledge-verification pipelinesthat are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with acorpus-grounded signalderived fromWikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it totoken-level advantagesvia a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an averageTriviaQAgain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.

View arXiv pageView PDFGitHub0Add to collection

Get this paper in your agent:

hf papers read 2605\.29648

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29648 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29648 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29648 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

arXiv cs.CL

AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

arXiv cs.LG

VeriGate extends GRPO with verifier-gated step-level supervision, providing fine-grained credit assignment when verifier rewards are degenerate. It achieves substantial accuracy improvements on reasoning benchmarks for 1.5B and 7B models.

@neural_avb: https://x.com/neural_avb/status/2063907440509571354

X AI KOLs Timeline

Explores a common failure mode in recursive language models (RLMs) where free-text subagent responses cause issues, and presents a solution using structured outputs to improve reliability, illustrated with a long-context question-answering example from NarrativeQA.

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

arXiv cs.CL

OCC-RAG introduces a family of compact small language models optimized for faithful question answering, using a novel pipeline to synthesize multi-context multi-hop QA data. The models demonstrate competitive performance against larger models on reasoning and faithfulness benchmarks.