Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Hugging Face Daily Papers Papers

Summary

This paper presents OmniClean, a visually debiased evaluation benchmark for omni-modal language models, and proposes OmniBoost, a three-stage post-training recipe that enables a 3B model to match the performance of a 30B model on the cleaned benchmark.

Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/
Original Article
View Cached Full Text

Cached at: 05/15/26, 12:25 PM

Paper page - Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Source: https://huggingface.co/papers/2605.12034 Published on May 13

·

Submitted byhttps://huggingface.co/che111

liuon May 15

Abstract

Research demonstrates that current omni-modal benchmarks may inflate performance through visual shortcuts, and shows that post-training techniques can improve model performance on a cleaned benchmark with reduced visual leakage.

Omni-modal language modelsare intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separatevisual shortcutsfrom genuineaudio-visual-language evidence integration, and howpost-trainingbehaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks withvisual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yieldsOmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. OnOmniClean, we evaluate OmniBoost, a three-stagepost-trainingrecipe based onQwen2.5-Omni-3B:mixed bi-modal SFT,mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, andself-distillationreshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from stagedpost-trainingwith self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.12034

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12034 in a model README.md to link it from this page.

Datasets citing this paper1

#### che111/OmniClean Viewer• Updatedabout 3 hours ago • 8.55k • 39 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12034 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

arXiv cs.CL

This paper introduces OmniThoughtVis, a scalable pipeline for distilling multimodal reasoning capabilities from large teacher models to smaller, deployment-oriented MLLMs. The method uses curated chain-of-thought data to significantly improve reasoning performance on benchmarks like MathVerse and MMMU-Pro for models ranging from 2B to 8B parameters.

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Hugging Face Daily Papers

This paper introduces Omni-DuplexEval, a benchmark and automatic evaluation framework for real-time duplex interaction in multimodal large language models, assessing continuous response generation and proactive event detection in streaming scenarios.