Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

Hugging Face Daily Papers Papers

Summary

The paper introduces CXR-MAX, a large-scale benchmark for evaluating reasoning alignment in non-stationary environments using X-ray data from multiple MLLMs.

This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:46 AM

Paper page - Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

Source: https://huggingface.co/papers/2510.04142 To evaluate reasoning alignment in non-stationary environments, a dataset exhibiting high-variance inter-model drift is essential. Existing benchmarks typically rely on single-source annotations or static consensus, failing to capture the dynamic conflicts inherent in multi-stream reasoning.

Addressing this gap, we introduce CXR-MAX (Multi-source Alignment for X-rays), a large-scale benchmark designed to facilitate the study of autonomous preference optimization. CXR-MAX extends the MIMIC-CXR dataset by aggregating reasoning trajectories from seven distinct, publicly available MLLMs. CXR-MAX provides 170,982 distillation instances of reasoning trajectories covering 14 thoracic pathologies.

Similar Articles

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Hugging Face Daily Papers

This paper introduces ReasonMatch-Bench, a benchmark for wide-baseline matching in multimodal LLMs, and proposes Dynamic Correspondence Reinforcement Learning (DCRL) to improve spatial reasoning. Experiments show significant gains on the benchmark while maintaining general performance.

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

arXiv cs.AI

This paper introduces satisfiable drift, a failure mode where multi-turn reasoning systems silently violate prior commitments while maintaining internal logical consistency, dominating contradictions. The authors present DRIFT-Bench, a benchmark of 816 problems, and find that after repair, 98-100% of residual errors are drift errors.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

arXiv cs.AI

This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

arXiv cs.CL

This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.