Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments
Summary
The paper introduces CXR-MAX, a large-scale benchmark for evaluating reasoning alignment in non-stationary environments using X-ray data from multiple MLLMs.
View Cached Full Text
Cached at: 05/08/26, 07:46 AM
Paper page - Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments
Source: https://huggingface.co/papers/2510.04142 To evaluate reasoning alignment in non-stationary environments, a dataset exhibiting high-variance inter-model drift is essential. Existing benchmarks typically rely on single-source annotations or static consensus, failing to capture the dynamic conflicts inherent in multi-stream reasoning.
Addressing this gap, we introduce CXR-MAX (Multi-source Alignment for X-rays), a large-scale benchmark designed to facilitate the study of autonomous preference optimization. CXR-MAX extends the MIMIC-CXR dataset by aggregating reasoning trajectories from seven distinct, publicly available MLLMs. CXR-MAX provides 170,982 distillation instances of reasoning trajectories covering 14 thoracic pathologies.
Similar Articles
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
This paper introduces ReasonMatch-Bench, a benchmark for wide-baseline matching in multimodal LLMs, and proposes Dynamic Correspondence Reinforcement Learning (DCRL) to improve spatial reasoning. Experiments show significant gains on the benchmark while maintaining general performance.
Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning
This paper introduces satisfiable drift, a failure mode where multi-turn reasoning systems silently violate prior commitments while maintaining internal logical consistency, dominating contradictions. The authors present DRIFT-Bench, a benchmark of 816 problems, and find that after repair, 98-100% of residual errors are drift errors.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.
GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models
This paper introduces GENSTRAT, a benchmark that uses procedurally generated strategic environments to evaluate LLMs' strategic reasoning across multiple axes, addressing limitations of fixed game suites.