Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Hugging Face Daily Papers Papers

Summary

This paper introduces a distribution-aware reinforcement learning framework that enhances MLLM performance in long-tailed numerical regression tasks using batch-level comparison-based supervision.

Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 07:28 AM

Paper page - Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Source: https://huggingface.co/papers/2605.01402 Published on May 11

·

Submitted byhttps://huggingface.co/ChanganYao

DUYaoon May 12

Abstract

A distribution-aware reinforcement learning framework improves multimodal large language models’ numerical regression performance on long-tailed distributions through batch-level comparison-based supervision.

Multimodal large language models(MLLMs) struggle withnumerical regressionunder long-tailed target distributions. Token-levelsupervised fine-tuning(SFT) andpoint-wise regression rewardsbias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-awarereinforcement learningframework based onGroup Relative Policy Optimization, which introducesbatch-level comparison-based supervision via theConcordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.01402

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.01402 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.01402 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.01402 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

arXiv cs.CL

This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Hugging Face Daily Papers

The paper introduces BalCapRL, a balanced reinforcement learning framework for multimodal large language models that jointly optimizes correctness, coverage, and linguistic quality in image captioning. It demonstrates improved performance over existing methods by addressing trade-offs between utility and fluency through reward decoupling and length-conditional masking.