Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering
Summary
CorVer is a lightweight, corpus-grounded reward mechanism that uses Wikipedia co-occurrence statistics to provide efficient sentence-level feedback for reinforcement learning in factual question answering, outperforming neural verifiers while training 4.8 to 8.4x faster.
View Cached Full Text
Cached at: 05/29/26, 07:00 AM
Paper page - Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering
Source: https://huggingface.co/papers/2605.29648
Abstract
CorVer, a corpus-grounded reward mechanism, enhances factual accuracy in question answering by providing efficient sentence-level feedback through Wikipedia co-occurrence statistics, outperforming neural verifiers while reducing training time.
Applyingreinforcement learningto improve factual accuracy inknowledge-intensive question answeringfaces areward designdilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely onNLI verifiers,LLM judges, orknowledge-verification pipelinesthat are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with acorpus-grounded signalderived fromWikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it totoken-level advantagesvia a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an averageTriviaQAgain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2605\.29648
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.29648 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29648 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29648 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
This paper proposes a Variance-Aware Reward Framework using GRPO to improve LLM performance on heart-focused medical question answering, achieving significant accuracy and F1 gains on a HealthBench subset.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.
VeriGate: Verifier-Gated Step-Level Supervision for GRPO
VeriGate extends GRPO with verifier-gated step-level supervision, providing fine-grained credit assignment when verifier rewards are degenerate. It achieves substantial accuracy improvements on reasoning benchmarks for 1.5B and 7B models.
@neural_avb: https://x.com/neural_avb/status/2063907440509571354
Explores a common failure mode in recursive language models (RLMs) where free-text subagent responses cause issues, and presents a solution using structured outputs to improve reliability, illustrated with a long-context question-answering example from NarrativeQA.
OCC-RAG: Optimal Cognitive Core for Faithful Question Answering
OCC-RAG introduces a family of compact small language models optimized for faithful question answering, using a novel pipeline to synthesize multi-context multi-hop QA data. The models demonstrate competitive performance against larger models on reasoning and faithfulness benchmarks.