OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

arXiv cs.CL 05/13/26, 04:00 AM Papers
Summary
This paper introduces OmniThoughtVis, a scalable pipeline for distilling multimodal reasoning capabilities from large teacher models to smaller, deployment-oriented MLLMs. The method uses curated chain-of-thought data to significantly improve reasoning performance on benchmarks like MathVerse and MMMU-Pro for models ranging from 2B to 8B parameters.
arXiv:2605.11629v1 Announce Type: new Abstract: Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering, difficulty-aware selection, and tag-based diversity sampling, resulting in a curated corpus of 1.8M samples that supports controllable subset construction for downstream training. We use OmniThoughtVis to distill Qwen3-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks. The resulting distilled models show consistent gains across model scales, including improvements of up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment-oriented MLLMs.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/13/26, 06:16 AM
# OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
Source: [https://arxiv.org/html/2605.11629](https://arxiv.org/html/2605.11629)
Yuanhao Yue, Chengyu Wang, Yuanjie Lyu, Lei Shen, Jun Huang Alibaba Group, Hangzhou, China \{yueyuanhao\.yyh,chengyu\.wcy,lyuyuanjie\.lyj, yuzhou\.sl,huangjun\.hj\}@alibaba\-inc\.com

###### Abstract

Recent multimodal large language models \(MLLMs\) have shown strong chain\-of\-thought \(CoT\) reasoning ability on vision\-language tasks, but their direct deployment in real\-world systems is often limited by latency and resource constraints\. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large\-scale, high\-quality multimodal CoT supervision\. In this paper, we presentOmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high\-capacity teacher models to smaller, deployment\-oriented MLLMs\. Starting from a diverse open\-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags\. To maintain data quality at scale, we combine rule\-based filtering, difficulty\-aware selection, and tag\-based diversity sampling, resulting in a curated corpus of 1\.8M samples that supports controllable subset construction for downstream training\. We useOmniThoughtVisto distill Qwen3\-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks\. The resulting distilled models show consistent gains across model scales, including improvements of up to \+16\.8 points on MathVerse and \+5\.6 points on MMMU\-Pro for the 4B model\. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment\-oriented MLLMs\.

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Yuanhao Yue, Chengyu Wang††thanks:Corresponding author\., Yuanjie Lyu, Lei Shen, Jun HuangAlibaba Group, Hangzhou, China\{yueyuanhao\.yyh,chengyu\.wcy,lyuyuanjie\.lyj,yuzhou\.sl,huangjun\.hj\}@alibaba\-inc\.com

## 1Introduction

Recent multimodal large language models \(MLLMs\) have demonstrated strong chain\-of\-thought \(CoT\) reasoning ability on a wide range of vision\-language tasks\(Bai et al\.,[2025b](https://arxiv.org/html/2605.11629#bib.bib2)\)\. However, directly deploying such models in real\-world applications remains challenging: large models typically incur substantial latency, memory, and serving cost, while smaller models that are more suitable for online inference often underperform on reasoning\-intensive tasks\(Hinton et al\.,[2015](https://arxiv.org/html/2605.11629#bib.bib10); Yang et al\.,[2025](https://arxiv.org/html/2605.11629#bib.bib26); Wang et al\.,[2025](https://arxiv.org/html/2605.11629#bib.bib22)\)\. This creates a practical gap between the reasoning quality of frontier MLLMs and the efficiency requirements of deployment\-oriented systems\.

A central bottleneck is the lack of large\-scale, high\-quality multimodal reasoning supervision\. In the text domain, several large\-scale CoT corpora have accelerated progress in natural language reasoning\(Chung et al\.,[2024](https://arxiv.org/html/2605.11629#bib.bib6); Toshniwal et al\.,[2024](https://arxiv.org/html/2605.11629#bib.bib21); Yu et al\.,[2024](https://arxiv.org/html/2605.11629#bib.bib27); Cai et al\.,[2025](https://arxiv.org/html/2605.11629#bib.bib3)\)\. In contrast, open multimodal datasets are still dominated by instruction tuning or final\-answer supervision, with limited support for structured, stepwise reasoning transfer\(Liu et al\.,[2023](https://arxiv.org/html/2605.11629#bib.bib15); Chen et al\.,[2024a](https://arxiv.org/html/2605.11629#bib.bib4); Li et al\.,[2025](https://arxiv.org/html/2605.11629#bib.bib14)\)\. As a result, smaller MLLMs fine\-tuned on such data often exhibit shallow reasoning behavior and struggle on compositional or multi\-step tasks\. Moreover, naively generating multimodal CoT data at scale is insufficient in practice: synthetic traces can be noisy, redundant, and unevenly distributed across task types, reducing their usefulness for reliable distillation\.

Recent efforts such as BEE\(Zhang et al\.,[2026](https://arxiv.org/html/2605.11629#bib.bib33)\)and OpenMMReasoner\(Zhang et al\.,[2025a](https://arxiv.org/html/2605.11629#bib.bib30)\)have advanced open multimodal data curation and reproducible training pipelines\. Our work is complementary to these efforts, but focuses on a different practical objective: transferring reasoning ability into smaller, deployment\-oriented MLLMs through a controllable large\-scale distillation pipeline\. In particular, rather than treating the source corpus as a ready\-to\-train dataset, we treat it as a seed pool and convert it into a distillation\-oriented reasoning corpus by generating structured CoT traces, attaching joint annotations for reasoning difficulty, answer quality, and semantic task tags, and applying filtering and diversity\-aware subset selection for downstream training\.

To this end, we presentOmniThoughtVis, a scalable data curation and distillation pipeline for multimodal reasoning\. Starting from a diverse open\-source seed pool, our pipeline distills structured CoT traces from a high\-capacity teacher model under explicit format constraints\. We then perform joint annotation of reasoning difficulty, answer quality, and semantic task tags, followed by rule\-based filtering, difficulty\-aware selection, and tag\-based diversity sampling\. This process yields a curated corpus of 1\.8M samples and supports controllable subset construction for training smaller reasoning\-capable MLLMs under practical compute and deployment constraints\.

We evaluate distilled Qwen3\-VL models ranging from 2B to 8B parameters on nine multimodal reasoning benchmarks\. Distillation withOmniThoughtVisleads to consistent improvements across model scales, including gains of up to \+16\.8 points on MathVerse and \+5\.6 points on MMMU\-Pro for the 4B model\. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, suggesting a favorable quality\-efficiency trade\-off for deployment\-oriented multimodal reasoning systems\. To support reproducible research and practical development, we will release the curated dataset, data pipeline, and model checkpoints\.

Our main contributions are as follows:

- •We presentOmniThoughtVis, a scalable pipeline for producing multimodal CoT supervision tailored to distilling reasoning capabilities into smaller, deployment\-oriented MLLMs\.
- •We introduce a practical data curation recipe that combines structured teacher generation with joint annotation, difficulty\-aware filtering, and diversity\-aware sampling, enabling controllable reasoning data construction at million\-sample scale\.
- •We show thatOmniThoughtVisconsistently improves distilled 2B–8B models across nine benchmarks, and we report practical observations on data selection and scaling behavior that are useful for building efficient multimodal reasoning systems\.

## 2Related Work

Models with Extended Reasoning\.Chain\-of\-thought \(CoT\) prompting\(Wei et al\.,[2022](https://arxiv.org/html/2605.11629#bib.bib24)\)has shown that exposing intermediate reasoning steps can substantially improve model performance on complex tasks\. Follow\-up work has explored transferring such reasoning behavior into smaller models using teacher\-generated rationales or step\-by\-step supervision\(Hsieh et al\.,[2023](https://arxiv.org/html/2605.11629#bib.bib12); Ho et al\.,[2023](https://arxiv.org/html/2605.11629#bib.bib11); Mukherjee et al\.,[2023](https://arxiv.org/html/2605.11629#bib.bib19)\)\. More recently, reinforcement learning with verifiable rewards has further strengthened long\-form reasoning in frontier models, as illustrated by DeepSeek\-R1\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.11629#bib.bib7)\)and OpenAI o1\. In parallel, the community has begun to curate large\-scale reasoning datasets for open models\(Guha et al\.,[2026](https://arxiv.org/html/2605.11629#bib.bib9); Cai et al\.,[2025](https://arxiv.org/html/2605.11629#bib.bib3)\)\. Our work is complementary to these efforts: rather than proposing a new reasoning algorithm, we focus on the practical problem of producing scalable multimodal reasoning supervision for distilling deployment\-oriented MLLMs\.

Multimodal Dataset Curation and Reasoning Supervision\.A large body of prior work has improved multimodal models through instruction tuning and synthetic data generation\. LLaVA\(Liu et al\.,[2023](https://arxiv.org/html/2605.11629#bib.bib15)\)introduced visual instruction tuning with GPT\-generated conversations, but its supervision is primarily short\-form and does not provide structured reasoning traces\. ScienceQA\(Lu et al\.,[2022](https://arxiv.org/html/2605.11629#bib.bib18)\)includes CoT\-style explanations, but its scale is limited and its domain is restricted to science questions\. ShareGPT4V\(Chen et al\.,[2024a](https://arxiv.org/html/2605.11629#bib.bib4)\)scales up visual description data using GPT\-4V, yet mainly targets perceptual grounding rather than multi\-step reasoning\. Cambrian\-1\(Tong et al\.,[2024](https://arxiv.org/html/2605.11629#bib.bib20)\)aggregates diverse vision\-language resources for broad instruction tuning, but does not explicitly optimize for reasoning\-oriented distillation into smaller models\.

More recent open efforts such as BEE\(Zhang et al\.,[2026](https://arxiv.org/html/2605.11629#bib.bib33)\)and OpenMMReasoner\(Zhang et al\.,[2025a](https://arxiv.org/html/2605.11629#bib.bib30)\)have pushed multimodal data curation and reproducible training toward more capable reasoning models\. Our work differs in emphasis\. Rather than treating large open corpora as directly usable training data, we frame them as seed pools for building a distillation\-oriented reasoning corpus under practical constraints\.OmniThoughtVisadds three ingredients that are central to this goal: \(1\) structured CoT traces produced under explicit output constraints, \(2\) joint annotations of reasoning difficulty, answer quality, and semantic task tags, and \(3\) filtering and diversity\-aware subset selection designed to improve reasoning transfer to smaller models\. This combination enables finer control over training data composition and supports practical multimodal reasoning distillation at million\-sample scale\.

![Refer to caption](https://arxiv.org/html/2605.11629v1/fig/acl_industry_framework_v4.png)Figure 1:Overview of theOmniThoughtVisdata curation and distillation pipeline\. Starting from a broad open\-source seed pool, we generate structured multimodal CoT traces, apply joint annotation and quality control, and construct training subsets for distilling smaller reasoning\-capable MLLMs\.
## 3Methodology

We next describe theOmniThoughtVispipeline for building a distillation\-oriented multimodal reasoning corpus from open\-source seed data\. Figure[1](https://arxiv.org/html/2605.11629#S2.F1)provides an overview\. Our goal is not simply to aggregate more multimodal samples, but to transform a broad seed pool into a curated corpus that supports controllable reasoning transfer to smaller, deployment\-oriented MLLMs\.

### 3\.1Seed Data Sampling

We start from FineVision\(Wiedmann et al\.,[2025](https://arxiv.org/html/2605.11629#bib.bib25)\), a large\-scale aggregation of publicly available open\-source datasets spanning visual question answering, chart understanding, mathematical reasoning, and general visual reasoning\. In our pipeline, FineVision serves as a*seed pool*rather than a ready\-to\-train reasoning dataset\. We sample from it using stratified sampling to preserve domain and task diversity, while capping each category at 20K samples to prevent a small number of high\-frequency sources from dominating the candidate pool\. This procedure yields an initial pool of 3\.5M multimodal samples covering a broad range of domains and task types\.

### 3\.2Distillation with Format Constraints

We use Qwen\-VL\-Max\(Bai et al\.,[2025a](https://arxiv.org/html/2605.11629#bib.bib1)\)as the teacher model to generate CoT traces for retained samples\. Decoding is performed with temperatureT=0\.5T=0\.5, which we found to provide a practical balance between trace diversity and output stability, and the maximum generation length is set to 8192 tokens\. To make the outputs easier to parse, score, and filter at scale, we enforce a structured XML\-style format that explicitly separates the reasoning trace from the final answer, as shown below\. Samples that do not satisfy the required format are removed during subsequent quality control\. Detailed prompts are provided in Appendix[A\.1](https://arxiv.org/html/2605.11629#A1.SS1)\.

<think\>Step\-by\-step reasoning trace</think\><answer\>Final answer</answer\>

This explicit structure serves two practical purposes\. First, it enables robust downstream parsing for large\-scale automated curation\. Second, it supports separate handling of intermediate reasoning and final predictions during filtering, evaluation, and future reuse of the dataset\.

### 3\.3Joint Scoring and Semantic Tagging

A central component of our pipeline is the*joint*annotation of each generated sample along three axes: reasoning difficulty, answer quality, and semantic task tags\. Given an image, instruction, and teacher response, we prompt a scorer model to output a JSON object containing these fields\.111See Appendices[A\.2](https://arxiv.org/html/2605.11629#A1.SS2)and[A\.3](https://arxiv.org/html/2605.11629#A1.SS3)for implementation details\.Joint annotation reduces redundant inference passes and helps maintain consistency across annotations at million\-sample scale\.

Difficulty \(1–5\):This score reflects the cognitive complexity of the underlying task\. Level 1 corresponds to simple recognition such as object presence, color, or shape identification\. Level 2 covers basic counting or spatial relations\. Level 3 involves moderate reasoning over actions or attributes\. Level 4 denotes more challenging multi\-step reasoning involving subtle visual cues or uncommon concepts\. Level 5 represents abstract reasoning, complex scene understanding, or ambiguous contexts\.

Quality \(1–5\):This score reflects the estimated correctness and completeness of the generated response\. Level 1 denotes a fully incorrect or irrelevant answer, Level 3 indicates partial correctness, and Level 5 denotes a response judged to be accurate and complete\.

Semantic tags:We additionally generate a set of task tags \(e\.g\.,counting,spatial,reasoning,math,object\) to characterize the sample\. The scorer outputs only a JSON object, enabling robust automated parsing and downstream selection\. Frequent tags are visualized in Figure[2](https://arxiv.org/html/2605.11629#S3.F2)\.

We use Qwen3\-VL\-Flash as the scorer because annotation throughput is a practical bottleneck in million\-sample curation, and this model offers a favorable trade\-off between multimodal understanding quality and inference efficiency\. This LLM\-as\-a\-judge setup\(Zheng et al\.,[2023](https://arxiv.org/html/2605.11629#bib.bib34)\)provides a scalable way to attach metadata to generated traces\. In our experiments, difficulty proved to be the most actionable signal for downstream data selection, while quality scores were more useful as an auxiliary inspection signal than as a strong standalone filter\.

![Refer to caption](https://arxiv.org/html/2605.11629v1/fig/wordcloud.jpg)Figure 2:Word cloud visualization of the top 400 task\-related labels inOmniThoughtVis\. Frequent labels such asreasoning,comparison,count,object, andscenesuggest broad semantic coverage across visual understanding and multimodal reasoning tasks\.
### 3\.4Quality Control and Subset Construction

We apply quality control in three stages, followed by subset construction for training\.

Stage 1 \(Rule\-Based Filtering\):We remove samples that \(i\) lack the required output tags, \(ii\) have CoT traces shorter than 20 tokens or longer than 4000 tokens, \(iii\) contain placeholder text or strong repetition, or \(iv\) exhibit unstable generation patterns such as repeated restarts, unresolved contradictions, or abrupt self\-corrections\. We note that self\-correction is not always undesirable; however, in our pipeline, these patterns often correlate with malformed or low\-confidence traces that are difficult to parse and verify automatically\. The 4000\-token upper bound is used as a practical trade\-off between preserving rich supervision and avoiding excessively long traces that often contain repetition or drift while substantially increasing training cost\. Approximately 5% of samples are removed at this stage\.

Stage 2 \(Difficulty\-Aware Selection\):Rather than applying a fixed quality\-only threshold, we compare multiple filtering strategies in Section[4\.4](https://arxiv.org/html/2605.11629#S4.SS4)\. Based on these experiments, we adopt Difficulty≥\\geq4 as the primary data selection criterion for the training subset\. Empirically, this criterion retains more reasoning\-intensive samples and yields better transfer than filtering by quality alone\.

Stage 3 \(Tag\-Based Diversity Sampling\):To improve coverage beyond dominant task categories, we perform diversity\-aware subset selection using semantic tags\. We embed tags using a text embedding model \(e\.g\.,Qwen3\-Embedding\(Zhang et al\.,[2025b](https://arxiv.org/html/2605.11629#bib.bib32)\)\) and cluster them with DBSCAN\(Ester et al\.,[1996](https://arxiv.org/html/2605.11629#bib.bib8)\)to merge near\-synonymous tags \(e\.g\.,spatialandposition\)\. Each sample is then represented by the average of its clustered tag embeddings, producing a coarse reasoning\-profile vector\. We apply farthest\-point sampling in this space to construct a more diverse subset and reduce over\-representation of frequent categories such as object\-centric recognition\. This tag space also enables domain\-targeted curation: practitioners can filter or resample by tags \(e\.g\., spatial reasoning or chart interpretation\) without regenerating the full corpus\.

Training subset construction:The fullOmniThoughtViscorpus contains 1\.8M curated samples\. In this paper, we train on selected subsets under fixed compute budgets to enable controlled comparison across model scales\. This setup reflects a practical deployment\-oriented workflow in which a large candidate pool is curated once, and task\- or budget\-specific subsets are then selected for downstream training\.

![Refer to caption](https://arxiv.org/html/2605.11629v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.11629v1/x2.png)

Figure 3:Difficulty \(top\) and quality \(bottom\) distributions inOmniThoughtVis\. Difficulty spans a broad range, while quality is concentrated in the high\-score regime, reflecting the generally strong teacher\-generated responses in the retained corpus\.
### 3\.5Dataset Statistics

Figure[3](https://arxiv.org/html/2605.11629#S3.F3)summarizes the key statistics ofOmniThoughtVis\. The difficulty distribution is approximately unimodal and centered at Level 3 \(46\.5%\), with Levels 2 \(23\.6%\) and 4 \(29\.6%\) also well represented; extreme levels are rare \(Level 1: 0\.1%, Level 5: 0\.2%\)\. The quality distribution is strongly skewed toward high ratings, with 91\.5% of samples assigned Level 5 and 6\.4% assigned Level 4\. This concentration suggests that quality scores are less discriminative than difficulty scores in our current setup, which is consistent with our later finding that difficulty is the more useful signal for training subset selection\. Analysis of semantic tags confirms broad domain coverage, includingreasoning\(85%\),object\(75%\),scene\(45%\),count\(38%\),text\(37%\), andmath\(33%\)\. Overall,OmniThoughtVisis characterized not only by scale, but also by structured reasoning supervision and controllable metadata for downstream curation\.

Table 1:Main results comparing Qwen3\-VL baseline models and models fine\-tuned onOmniThoughtVis\. MMB = MMBench, Pro\-S = ProStandard, and Pro\-V = ProVision\. All values are percentages \(%\)\.

## 4Experiments

We conduct experiments to evaluate whetherOmniThoughtVisimproves reasoning transfer to smaller multimodal models under a practical distillation setup\. Our analysis focuses on three questions: \(1\) whether the curated corpus improves model performance across scales, \(2\) which data selection choices matter most in practice, and \(3\) how reasoning performance evolves during training\.

### 4\.1Experimental Setup

Training Protocol\.We initialize three models from the Qwen3\-VL family\(Bai et al\.,[2025a](https://arxiv.org/html/2605.11629#bib.bib1)\)as student backbones: Qwen3\-VL\-2B, Qwen3\-VL\-4B, and Qwen3\-VL\-8B\. We refer to the resulting distilled models as the DistilQwen3\-VL series\. Unless otherwise specified, all models are trained on a selected 0\.5M\-sample subset from the full 1\.8M\-sampleOmniThoughtViscorpus\. We use a fixed 0\.5M subset to enable controlled comparison across model scales under a common compute budget; the larger corpus remains important as a candidate pool for difficulty\-aware and tag\-aware subset construction\.

Training uses AdamW \(β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, weight decay=0\.01=0\.01\), batch size 128, sequence length 3072, linear warmup over 3% of training steps, and cosine learning rate decay to 10% of the peak value\. The original Qwen3\-VL\-Instruct models serve as baselines so that observed gains can be attributed to fine\-tuning onOmniThoughtVisrather than changes in backbone architecture\.

We perform a grid search over learning rates\{5×10−6,1×10−5,2×10−5,5×10−5\}\\\{5\{\\times\}10^\{\-6\},1\{\\times\}10^\{\-5\},2\{\\times\}10^\{\-5\},5\{\\times\}10^\{\-5\}\\\}and training epochs\{1,2,3,4,5\}\\\{1,2,3,4,5\\\}on the 2B model, evaluating all 20 configurations on a held\-out 10K\-sample split\. Figure[4](https://arxiv.org/html/2605.11629#S4.F4)shows that the best setting uses a learning rate of5×10−65\{\\times\}10^\{\-6\}for 4 epochs, which we adopt for all model scales\. Higher learning rates generally lead to weaker performance, while additional epochs beyond the best setting do not provide consistent gains across learning rates\.

Benchmarks\.We evaluate on nine multimodal reasoning benchmarks\. Forgeneral visual understanding and knowledge\-intensive reasoning, we use AI2D\(Kembhavi et al\.,[2016](https://arxiv.org/html/2605.11629#bib.bib13)\), MMBench\(Liu et al\.,[2024](https://arxiv.org/html/2605.11629#bib.bib16)\), MMMU\(Yue et al\.,[2024a](https://arxiv.org/html/2605.11629#bib.bib28)\), MMMU\-Pro \(Standard and Vision\)\(Yue et al\.,[2024b](https://arxiv.org/html/2605.11629#bib.bib29)\), and MMStar\(Chen et al\.,[2024b](https://arxiv.org/html/2605.11629#bib.bib5)\)\. Formathematical reasoning, we evaluate on MathVista\(Lu et al\.,[2024](https://arxiv.org/html/2605.11629#bib.bib17)\), MathVerse\(Zhang et al\.,[2024](https://arxiv.org/html/2605.11629#bib.bib31)\), and MathVision\(Wang et al\.,[2024](https://arxiv.org/html/2605.11629#bib.bib23)\)\. Together, these benchmarks cover tasks ranging from visual comprehension to multi\-step multimodal inference\.

Evaluation Protocol\.All models are evaluated with thinking prompts \(Appendix[A\.1](https://arxiv.org/html/2605.11629#A1.SS1)\) and decoding temperature 0\.5\. For multiple\-choice benchmarks, we report exact\-match accuracy after extracting predictions from<answer\>tags\. For open\-ended tasks requiring LLM\-based judging, we use Qwen3\-VL\-Plus\. Because teacher generation, automatic scoring, and student backbones all involve Qwen\-family models, some family\-specific bias may remain; we therefore place primary emphasis on objective benchmark results and treat judge\-based evaluation as a secondary signal\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2605.11629#S3.T1)compares baseline Qwen3\-VL models with their distilled counterparts trained onOmniThoughtVis\. Distillation yields consistent gains across all three model scales\. For the 4B model, improvements are especially pronounced on reasoning\-intensive benchmarks such as MathVerse \(\+16\.8\) and MMMU\-Pro \(\+5\.6\), suggesting that structured multimodal CoT supervision can effectively transfer reasoning behavior to smaller models\. The 2B model shows similar gains, with MathVerse improving from 34\.1 to 43\.0 and MathVision from 20\.4 to 27\.3\. The 8B model achieves the strongest absolute scores, reaching 83\.7 on AI2D and 55\.5 on MathVerse\.

Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several benchmarks \(e\.g\., 81\.6 vs\. 81\.0 on AI2D and 60\.3 vs\. 60\.0 on MMMU\)\. While these comparisons do not replace direct deployment measurements, they indicate a favorable quality–efficiency trade\-off and suggest that scalable reasoning distillation can improve the usefulness of smaller models in deployment\-oriented settings\.

### 4\.3Detailed Analysis

Hyperparameter Search\.We perform a grid search over learning rates and training epochs on the validation set\. For DistilQwen3\-VL\-2B, we evaluate four learning rates \(5×10−6,1×10−5,2×10−5,5×10−55\\times 10^\{\-6\},1\\times 10^\{\-5\},2\\times 10^\{\-5\},5\\times 10^\{\-5\}\) across 1–5 epochs, using MMMU\-Pro Vision as the selection metric\. As shown in Figure[4](https://arxiv.org/html/2605.11629#S4.F4), hyperparameter choice has a substantial effect\. The best configuration uses a learning rate of5×10−65\\times 10^\{\-6\}for 4 epochs, achieving a score of 0\.3220 compared with the 0\.2130 baseline\. Higher learning rates generally degrade performance, and the best performance is reached before the final epoch for the strongest low\-learning\-rate setting\. We therefore use 4 epochs as the default setting in subsequent experiments\.

![Refer to caption](https://arxiv.org/html/2605.11629v1/x3.png)Figure 4:Hyperparameter search on MMMUPro\-Vfor DistilQwen3\-VL\-2B\. Baseline: 0\.2130\.Data Filtering Strategies\.We compare filtering by quality and/or difficulty under otherwise identical settings, with each model trained on approximately 100K samples \(Table[2](https://arxiv.org/html/2605.11629#S4.T2)\)\. Quality\-only filtering \(Quality≥\\geq5\) underperforms random sampling \(0\.2948 vs\. 0\.3093\), whereas difficulty\-only filtering \(Difficulty≥\\geq4\) performs best \(0\.3156\)\. Combining both criteria \(Quality≥\\geq5 & Difficulty≥\\geq4\) also improves over random sampling \(0\.3139\), but remains slightly weaker than difficulty\-only selection\. These results suggest that difficulty is the more useful selection signal in our current pipeline, while quality scores are less discriminative for subset construction\.

Scaling Behavior During Training\.We further analyze training\-time scaling by evaluating DistilQwen3\-VL\-8B across all nine benchmarks\. Figure[5](https://arxiv.org/html/2605.11629#S4.F5)shows the average performance trajectory, while Figure[6](https://arxiv.org/html/2605.11629#S4.F6)reports per\-benchmark trends\. The average score increases from approximately 59\.5 at 1K steps to 63\.5 at 19K steps, indicating continued gains throughout training\. These gains are not uniform across tasks: general visual understanding benchmarks such as AI2D and MMBench improve quickly and then plateau, whereas reasoning\-intensive benchmarks such as MathVerse, MMMU, and MMMU\-Pro continue to improve for longer\.

This difference suggests that general understanding and reasoning transfer may saturate on different timescales\. In practical terms, it indicates that uniform training schedules may not be optimal for multimodal reasoning distillation, and motivates future curriculum or stage\-wise training strategies that allocate more training emphasis to reasoning\-heavy data later in training\.

Table 2:Ablation of data filtering strategies\. All models are trained on approximately 100K samples\.![Refer to caption](https://arxiv.org/html/2605.11629v1/fig/omnithoughtv_8B_scaling.png)Figure 5:Average performance across nine benchmarks during DistilQwen3\-VL\-8B training onOmniThoughtVis\. Performance improves from 59\.5 to 63\.5 over 19K steps\.![Refer to caption](https://arxiv.org/html/2605.11629v1/fig/scaling.png)Figure 6:Per\-benchmark scaling for DistilQwen3\-VL\-8B\. General visual benchmarks plateau earlier, while reasoning\-intensive benchmarks continue to improve\.
### 4\.4Practical Insights

Our experiments suggest three practical lessons for large\-scale multimodal reasoning distillation\.

Difficulty is a more actionable data\-selection signal than quality\.In our current setup, quality scores are concentrated in the high\-score regime and are therefore less useful for separating informative from uninformative training examples\. By contrast, selecting more difficult samples consistently yields stronger reasoning transfer\. This makes difficulty a more practical signal for subset construction in large\-scale synthetic reasoning pipelines\.

Reasoning transfer and general understanding do not saturate at the same rate\.Scaling curves show that general visual understanding often improves early and then plateaus, while reasoning\-heavy benchmarks continue to benefit from additional training\. This suggests that multimodal distillation pipelines may benefit from stage\-wise or curriculum\-based scheduling rather than uniform training across all examples\.

A large curated pool is useful even when the final training subset is smaller\.Although our main experiments train on 0\.5M selected samples, the full 1\.8M\-sample corpus remains valuable because it provides enough coverage for filtering, diversity\-aware resampling, and future task\-specific subset construction\. In practical settings, curating a large candidate pool once and selecting smaller training subsets later can be more flexible than repeatedly regenerating task\-specific data\.

Due to space limitations, additional case studies are provided in Appendix[B](https://arxiv.org/html/2605.11629#A2)\.

## 5Conclusion

We presentedOmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities into smaller, deployment\-oriented MLLMs\. Starting from a broad open\-source seed pool, our pipeline combines structured CoT generation, joint annotation, and downstream subset selection to construct a 1\.8M\-sample reasoning corpus with controllable metadata\. Our experiments show that this pipeline consistently improves distilled models across nine benchmarks, with especially strong gains on reasoning\-intensive tasks and a favorable quality–efficiency trade\-off for smaller models\.

Beyond the released corpus itself, we viewOmniThoughtVisas a practical recipe for large\-scale multimodal reasoning distillation\. In particular, our results suggest that difficulty\-aware selection is more useful than quality\-only filtering in the current setup, and that reasoning\-heavy benchmarks may benefit from different training dynamics than general visual understanding tasks\. We will release the curated dataset, pipeline, and model checkpoints to support reproducible research and the practical development of efficient multimodal reasoning systems\.

## Limitations

WhileOmniThoughtViscovers a broad range of multimodal reasoning tasks and yields consistent gains in our experiments, several limitations remain\. First, our current pipeline primarily focuses on English\-language, vision\-centric data\. Extending the approach to additional languages and modalities would improve its generalizability\. Second, the dataset is constructed through automated teacher generation and model\-based scoring\. Although this enables scalable curation, some generated reasoning traces or metadata may still contain noise or model\-specific biases\. Further validation with broader evaluation settings would strengthen confidence in the resulting corpus\. Third, our experiments focus on benchmark\-based evaluation under a fixed training budget\. Additional studies on more diverse deployment scenarios and larger\-scale training configurations would help further assess the practical utility of the pipeline\. We leave these directions to future work\.

## Ethical Considerations

OmniThoughtVisis constructed from publicly available, open\-source vision\-language datasets using automated generation, scoring, and filtering procedures\. Although these steps are intended to improve data quality and task diversity, several ethical considerations remain\. First, the dataset may inherit biases present in the source data or teacher models, which could affect the behavior of distilled models\. Second, because the annotations are largely automated, some samples may contain inaccurate reasoning traces or metadata\. Third, the dataset is primarily English\-language and vision\-centric, which may limit coverage across languages, domains, and populations\. All source datasets were checked for license compatibility to the best of our knowledge, and we only include data intended for research use\. We encourage responsible use and additional evaluation before applying models trained onOmniThoughtVisin downstream settings\.

## References

- Bai et al\. \(2025a\)Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu\. 2025a\.[Qwen3\-vl technical report](http://arxiv.org/abs/2511.21631)\.
- Bai et al\. \(2025b\)Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin\. 2025b\.[Qwen2\.5\-vl technical report](http://arxiv.org/abs/2502.13923)\.
- Cai et al\. \(2025\)Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, and Xiangzhong Fang\. 2025\.Reasoning with omnithought: A large cot dataset with verbosity and cognitive difficulty annotations\.*arXiv preprint arXiv:2505\.10937*\.
- Chen et al\. \(2024a\)Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin\. 2024a\.[Sharegpt4v: Improving large multi\-modal models with better captions](https://doi.org/10.1007/978-3-031-72643-9_22)\.In*Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XVII*, pages 370–387, Berlin, Heidelberg\. Springer\-Verlag\.
- Chen et al\. \(2024b\)Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao\. 2024b\.[Are we on the right way for evaluating large vision\-language models?](https://openreview.net/forum?id=evP9mxNNxJ)In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*\.
- Chung et al\. \(2024\)Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al\. 2024\.Scaling instruction\-finetuned language models\.*Journal of Machine Learning Research*, 25\(70\):1–53\.
- DeepSeek\-AI \(2025\)DeepSeek\-AI\. 2025\.[Deepseek\-r1 incentivizes reasoning in llms through reinforcement learning](https://doi.org/10.1038/s41586-025-09422-z)\.*Nature*, 645\(8081\):633–638\.
- Ester et al\. \(1996\)Martin Ester, Hans\-Peter Kriegel, Jörg Sander, and Xiaowei Xu\. 1996\.A density\-based algorithm for discovering clusters in large spatial databases with noise\.In*Proceedings of the Second International Conference on Knowledge Discovery and Data Mining*, KDD’96, pages 226–231\. AAAI Press\.
- Guha et al\. \(2026\)Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik sharma, Charlie Cheng\-Jie Ji, Yichuan Deng, Sarah M Pratt, Vivek Ramanujan, Jon Saad\-Falcon, Stutee Acharya, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai\-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alex Dimakis, and Ludwig Schmidt\. 2026\.[Openthoughts: Data recipes for reasoning models](https://openreview.net/forum?id=7xjoTuaNmN)\.In*The Fourteenth International Conference on Learning Representations*\.
- Hinton et al\. \(2015\)Geoffrey Hinton, Oriol Vinyals, and Jeff Dean\. 2015\.Distilling the knowledge in a neural network\.*arXiv preprint arXiv:1503\.02531*\.
- Ho et al\. \(2023\)Namgyu Ho, Laura Schmid, and Se\-Young Yun\. 2023\.[Large language models are reasoning teachers](https://doi.org/10.18653/v1/2023.acl-long.830)\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 14852–14882, Toronto, Canada\. Association for Computational Linguistics\.
- Hsieh et al\. \(2023\)Cheng\-Yu Hsieh, Chun\-Liang Li, Chih\-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen\-Yu Lee, and Tomas Pfister\. 2023\.[Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes](https://doi.org/10.18653/v1/2023.findings-acl.507)\.In*Findings of the Association for Computational Linguistics: ACL 2023*, pages 8003–8017, Toronto, Canada\. Association for Computational Linguistics\.
- Kembhavi et al\. \(2016\)Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi\. 2016\.A diagram is worth a dozen images\.In*European conference on computer vision*, pages 235–251\. Springer\.
- Li et al\. \(2025\)Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li\. 2025\.[LLaVA\-onevision: Easy visual task transfer](https://openreview.net/forum?id=zKv8qULV6n)\.*Transactions on Machine Learning Research*\.
- Liu et al\. \(2023\)Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee\. 2023\.Visual instruction tuning\.*Advances in neural information processing systems*, 36:34892–34916\.
- Liu et al\. \(2024\)Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin\. 2024\.[Mmbench: Is your multi\-modal model an all\-around player?](https://doi.org/10.1007/978-3-031-72658-3_13)In*Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VI*, pages 216–233, Berlin, Heidelberg\. Springer\-Verlag\.
- Lu et al\. \(2024\)Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai\-Wei Chang, Michel Galley, and Jianfeng Gao\. 2024\.[Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts](https://openreview.net/forum?id=KUNzEQMWU7)\.In*The Twelfth International Conference on Learning Representations*\.
- Lu et al\. \(2022\)Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai\-Wei Chang, Song\-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan\. 2022\.[Learn to explain: Multimodal reasoning via thought chains for science question answering](https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf)\.In*Advances in Neural Information Processing Systems*, volume 35, pages 2507–2521\. Curran Associates, Inc\.
- Mukherjee et al\. \(2023\)Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah\. 2023\.Orca: Progressive learning from complex explanation traces of gpt\-4\.*arXiv preprint arXiv:2306\.02707*\.
- Tong et al\. \(2024\)Shengbang Tong, Ellis L Brown II, Penghao Wu, Sanghyun Woo, ADITHYA JAIRAM IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, Xichen Pan, Rob Fergus, Yann LeCun, and Saining Xie\. 2024\.[Cambrian\-1: A fully open, vision\-centric exploration of multimodal LLMs](https://openreview.net/forum?id=Vi8AepAXGy)\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*\.
- Toshniwal et al\. \(2024\)Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman\. 2024\.Openmathinstruct\-1: A 1\.8 million math instruction tuning dataset\.*Advances in Neural Information Processing Systems*, 37:34737–34774\.
- Wang et al\. \(2025\)Chengyu Wang, Junbing Yan, Yuanhao Yue, and Jun Huang\. 2025\.Distilqwen2\.5: Industrial practices of training distilled open lightweight language models\.*arXiv preprint arXiv:2504\.15027*\.
- Wang et al\. \(2024\)Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li\. 2024\.Measuring multimodal mathematical reasoning with math\-vision dataset\.*Advances in Neural Information Processing Systems*, 37:95095–95169\.
- Wei et al\. \(2022\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al\. 2022\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*, 35:24824–24837\.
- Wiedmann et al\. \(2025\)Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti\. 2025\.Finevision: Open data is all you need\.*arXiv preprint arXiv:2510\.17269*\.
- Yang et al\. \(2025\)Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, and Yiqiang Chen\. 2025\.[Survey on knowledge distillation for large language models: Methods, evaluation, and application](https://doi.org/10.1145/3699518)\.*ACM Trans\. Intell\. Syst\. Technol\.*, 16\(6\)\.
- Yu et al\. \(2024\)Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu\. 2024\.[Metamath: Bootstrap your own mathematical questions for large language models](https://openreview.net/forum?id=N8N0hgNDRt)\.In*The Twelfth International Conference on Learning Representations*\.
- Yue et al\. \(2024a\)Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al\. 2024a\.Mmmu: A massive multi\-discipline multimodal understanding and reasoning benchmark for expert agi\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 9556–9567\.
- Yue et al\. \(2024b\)Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig\. 2024b\.Mmmu\-pro: A more robust multi\-discipline multimodal understanding benchmark\.*arXiv preprint arXiv:2409\.02813*\.
- Zhang et al\. \(2025a\)Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing\. 2025a\.Openmmreasoner: Pushing the frontiers for multimodal reasoning with an open and general recipe\.*arXiv preprint arXiv:2511\.16334*\.
- Zhang et al\. \(2024\)Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai\-Wei Chang, Yu Qiao, Peng Gao, and Hongsheng Li\. 2024\.[Mathverse: Does your multi\-modal llm truly see the diagrams in visual math problems?](https://doi.org/10.1007/978-3-031-73242-3_10)In*Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VIII*, pages 169–186, Berlin, Heidelberg\. Springer\-Verlag\.
- Zhang et al\. \(2025b\)Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al\. 2025b\.Qwen3 embedding: Advancing text embedding and reranking through foundation models\.*arXiv preprint arXiv:2506\.05176*\.
- Zhang et al\. \(2026\)Yi Zhang, Bolin Ni, Xin\-Sheng Chen, Hengrui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng\-Hao Guo, and Shi min Hu\. 2026\.[Bee: A high\-quality corpus and full\-stack suite to unlock advanced fully open MLLMs](https://openreview.net/forum?id=IVluwK8q9q)\.In*The Fourteenth International Conference on Learning Representations*\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\. 2023\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.In*Proceedings of the 37th International Conference on Neural Information Processing Systems*, NIPS ’23, Red Hook, NY, USA\. Curran Associates Inc\.

## Appendix ATechnical Details

### A\.1Distillation Prompt Template

Below, we detail the standard prompt template used when collecting long\-form CoT traces from high\-capacity multi\-modal teacher models:

#### System Prompt\.

The system prompt to the teacher is as follows:

> You are a helpful assistant to think step by step\. Provide your reasoning steps within <think\></think\> tags and give your final answer within <answer\></answer\> tags\.

#### Query Structure\.

Given an image and a natural language question, the input to the teacher follows the structured format:

> \#\#\# Image <image\> \#\#\# Question \{question\} \#\#\# Output Format \(Strictly Enforced\) <think\> Clearly explain your reasoning step by step\. Describe how you arrived at the conclusion\. The reasoning process MUST BE enclosed within <think\> </think\> tags\. /<think\> <answer\> Your final answer to the user’s question\. </answer\>

### A\.2Automatic Scoring Protocol

For data selection and curation, we use Qwen3\-VL\-Flash as an automatic scorer to evaluate each sample along two axes: reasoning difficulty and annotation quality\. Each sample is also assigned a set of open\-ended task tags; here we focus on the numeric scoring protocol\.

#### Reasoning Difficulty\.

Each sample is assigned a difficulty scoredi∈\{1,2,3,4,5\}d\_\{i\}\\in\\\{1,2,3,4,5\\\}based on the complexity of the underlying vision\-language reasoning required\. The criteria are:

- •1 \(Very Easy\):Object is plainly present; requires simple color or shape identification\.
- •2 \(Easy\):Involves basic counting of clearly visible items, or straightforward spatial relationships\.
- •3 \(Moderate\):Requires brief reasoning or identification of common actions or attributes\.
- •4 \(Hard\):Demands multi\-step reasoning, attention to subtle visual cues, or handling rare concepts\.
- •5 \(Very Hard\):Involves abstract reasoning, complex scene understanding, or highly ambiguous context\.

#### Annotation Quality\.

Each annotated output is assigned a quality scoreqi∈\{1,2,3,4,5\}q\_\{i\}\\in\\\{1,2,3,4,5\\\}reflecting estimated correctness, relevance, and completeness, defined as follows:

- •1 \(Very Low\):Entirely incorrect or irrelevant answer\.
- •2 \(Low\):Mostly incorrect, with only minor correct elements present\.
- •3 \(Medium\):Partially correct; key details are missing or incorrect elements are present\.
- •4 \(High\):Largely correct, with only minor inaccuracies or omissions\.
- •5 \(Very High\):Fully accurate, precise, and complete response\.

These scoring guidelines provide a practical framework for large\-scale multimodal data curation\. In the main experiments, we find that difficulty is the more useful signal for subset selection, while quality is primarily used as an auxiliary indicator during inspection and filtering\.

### A\.3Semantic Tag Definitions and Coverage

To support fine\-grained subset construction and downstream analysis, each sample in the dataset is assigned one or more semantic tags reflecting its reasoning domain, visual theme, and task type\. Statistical analysis of the tag distribution indicates that the curated corpus covers a broad range of domains and cognitive skills, including but not limited to:

- •Visual Understanding:This includes object recognition \(object\), attribute parsing \(attribute\), and holistic scene comprehension \(scene\)\.
- •Spatial Reasoning:Tags represent positional and relational understanding such as spatial relations \(spatial,position,layout\), directionality \(direction\), and coordinate\-based inference \(coordinate\)\.
- •Chart and Table Interpretation:We support multiple visualization formats, reflected in tags such aschart,graph,table,bar chart,pie chart, andscatter plot, as well as related meta\-information \(axis,legend,label,caption\)\.
- •Logical and Mathematical Reasoning:The dataset includes tags for counting \(count\), comparison \(comparison\), and broader mathematical and statistical reasoning \(math,algebra,geometry,percentage,ratio,average\)\.
- •Domain\-Specific Knowledge:Coverage extends to science \(physics,chemistry,biology\), technology \(code,algorithm,electronics\), and humanities \(history,politics,literature\)\.
- •Reflective and Error\-Aware Reasoning:Tags emphasize verification \(verification,fact\-check\), error identification \(error,discrepancy\), and attention to detail \(precision,detail\)\.

These semantic tags support systematic sampling, targeted curriculum design, and analysis across diverse reasoning tasks and domains\.

## Appendix BCase Study

We qualitatively compare source annotations from FineVision with the structured annotations produced byOmniThoughtVis\. These examples are intended to illustrate how structured reasoning traces can make supervision more auditable, explicit, and useful for downstream distillation\. In many cases, the source annotations provide concise conclusions or coarse descriptions, while our pipeline decomposes the task into explicit reasoning steps \(e\.g\., element identification, subgoal reasoning, evidence checking, and exception identification\)\. This structure can improve interpretability and make it easier to inspect or reuse the resulting supervision\.

Case 1 \(Scientific concept grounding\)\.The FineVision annotation gives a brief justification for “ionic” bonding, but remains relatively coarse\. Our annotation follows a structured checklist \(entity identification→\\rightarrowbonding\-type analysis→\\rightarrowdominance judgment→\\rightarrowstructural verification\), explicitly connecting the final answer to observable cues such as metal–oxygen composition, charged oxygen, and coordination pattern\.

Case 2 \(Geometry problem; error correction\)\.The FineVision annotation outputs an option letter \(A\) without intermediate verification\. Our structured annotation decomposes the solution into geometry subgoals \(triangle properties→\\rightarrowperpendicular\-from\-center theorem→\\rightarrowangle/arc relationships\), providing a traceable derivation that also recovers the correct answer\.

Case 3 \(Meme understanding with multi\-dimensional affect\)\.The FineVision annotation summarizes the meme with coarse labels \(e\.g\., “funny” and “positive”\)\. Our annotation instead allocates one reasoning step to each requested dimension \(humor, sarcasm, offensiveness, motivational quality, and sentiment\), tying the judgment more explicitly to textual and visual evidence\.

Case 4 \(Chart reasoning; error correction\)\.The FineVision annotation answers “Yes” without supporting comparison\. Our annotation performs a more explicit curve\-by\-curve analysis followed by an overall comparison, making the visual evidence behind the answer easier to verify\.

Case 5 \(Fine\-grained visual search in a cluttered scene\)\.The FineVision annotation points to a single candidate using a brief description\. Our annotation adopts anenumerate\-then\-identifystrategy: it first lists visible individuals and their helmet status, and only then identifies the unique exception\. This yields more explicit evidence coverage for “find\-the\-exception” tasks\.

Case 1![[Uncaptioned image]](https://arxiv.org/html/2605.11629v1/fig/case/0213case1.png)Question:Does this compound exhibit ionic or covalent bonding?■\\blacksquareOriginal Annotation \(FineVision\):The interactions between zirconium and oxygen in this structure indicate ionic bonding characteristics, particularly due to the presence of charged oxygen\.Answer:ionic★\\bigstarStructured Annotation \(Ours\):
<think\>
To determine whether the compound exhibits ionic or covalent bonding, we need to analyze the nature of the bonds between the atoms in the structure\.Step 1: Identify the elements involved\.The structure contains zirconium \(Zr\), yttrium \(Y\), and oxygen \(O\)\. Oxygen is highly electronegative, while Zr and Y are metals\.Step 2: Analyze metal–oxygen bonding\.Zr and Y are transition metals bonded to oxygen atoms\. Oxygen typically forms anions \(O\-\) in such compounds\. The presence of O\-ions suggests electrons are transferred from the metals to oxygen, forming ionic bonds\.Step 3: Evaluate the dominant bonding type\.In oxides of metals like Zr and Y, the bonding is primarily ionic due to the large difference in electronegativity\. Although some covalent character may exist due to polarization, the dominant bonding type remains ionic\.Step 4: Structural verification\.The structure shows Zr and Y each bonded to four O\-ions, typical for ionic oxides\. The arrangement resembles a perovskite\-like structure with metal cations surrounded by oxygen anions, consistent with ionic bonding\.Conclusion:Given the involvement of metals \(Zr, Y\) and oxygen, and the presence of O\-ions, the bonding is predominantly ionic\.</think\><answer\>Ionic bonding</answer\>Case 2![[Uncaptioned image]](https://arxiv.org/html/2605.11629v1/fig/case/0213case2.png)Question:Answer the mathematical geometry problem and directly provide the correct option letter\.CircleOOhas a radius of 10,AB=10AB=10andmAB⌢=60m\\,\\overset\{\\frown\}\{AB\}=60\. FindmAX⌢m\\,\\overset\{\\frown\}\{AX\}\.Choices:A: 5 B: 10 C: 20 D: 30■\\blacksquareOriginal Annotation \(FineVision\):A★\\bigstarStructured Annotation \(Ours\):
<think\>
We are given: CircleOOhas a radius of 10,AB=10AB=10, andmAB⌢=60∘m\\,\\overset\{\\frown\}\{AB\}=60^\{\\circ\}\. We need to findmAX⌢m\\,\\overset\{\\frown\}\{AX\}\.Step 1: Analyze triangleOABOAB\.SinceABABis a chord andOA=OB=10OA=OB=10\(radii\), andAB=10AB=10, triangleOABOABhas all sides equal to 10 — it is an equilateral triangle\. Therefore∠AOB=60∘\\angle AOB=60^\{\\circ\}, consistent withmAB⌢=60∘m\\,\\overset\{\\frown\}\{AB\}=60^\{\\circ\}\.Step 2: Identify pointXX\.PointXXis the foot of the perpendicular from centerOOto chordABAB\. By the perpendicular\-from\-center theorem,OX⟂ABOX\\perp ABbisects the chord, soAX=XB=AB/2=5AX=XB=AB/2=5\.Step 3: Compute the arc\.SinceOXOXbisects both chordABABand central angle∠AOB\\angle AOB, we have∠AOX=∠BOX=60∘/2=30∘\\angle AOX=\\angle BOX=60^\{\\circ\}/2=30^\{\\circ\}\. ArcAXAXcorresponds to central angle∠AOX\\angle AOX, somAX⌢=30∘m\\,\\overset\{\\frown\}\{AX\}=30^\{\\circ\}\.Conclusion:mAX⌢=30m\\,\\overset\{\\frown\}\{AX\}=30\. The correct answer isD\.</think\><answer\>D</answer\>Case 3![[Uncaptioned image]](https://arxiv.org/html/2605.11629v1/fig/case/0213case3.png)Question:Break down this meme’s emotional profile: humor level, sarcasm, offensiveness, motivational qualities, and sentiment\.■\\blacksquareOriginal Annotation \(FineVision\):This meme is funny\. It is generally sarcastic, not offensive, and not motivational\. The overall sentiment is positive\.★\\bigstarStructured Annotation \(Ours\):
<think\>
To analyze the emotional profile of this meme, let’s break it down step by step:1\. Humor Level\.The humor stems from relatability and irony\. Many students have experienced the frustration of being unable to recall examples during an exam and resorting to writing “etc\.” as a placeholder\. The image of Draco Malfoy, known for his smug demeanor, exaggerates this behavior\. This contrast between a common academic struggle and a character known for superiority creates a light\-hearted, situational humor\.2\. Sarcasm\.There is a strong element of sarcasm\. The phrase “LIKE A BOSS” is typically used to mock someone acting overly confident\. Here, it is applied to a student who pretends to be knowledgeable by using “etc\.,” highlighting the absurdity of feigning expertise\.3\. Offensiveness\.The meme is not offensive\. It does not target any specific group, individual, or sensitive topic\. Instead, it pokes fun at a universal student experience in a non\-malicious way\.4\. Motivational Qualities\.The meme lacks motivational qualities\. It mocks a common academic shortcut rather than inspiring positive behavior, making the audience laugh at their own past struggles\.5\. Sentiment\.The overall sentiment is negative but in a playful, ironic way\. It expresses frustration with academic limitations through humor and exaggeration\. The tone is self\-deprecating and slightly mocking, but not mean\-spirited\.Summary:The meme uses relatable academic anxiety, exaggerated confidence, and sarcasm to create humor without crossing into offensiveness\.</think\><answer\>
Humor Level: High \(relatable, ironic, and exaggerated\); Sarcasm: High \(mocking the pretense of knowledge\); Offensiveness: Low \(no targeted insults or sensitive topics\); Motivational Qualities: None \(mocks rather than inspires\); Sentiment: Playful negativity \(self\-deprecating humor with irony\)</answer\>Case 4![[Uncaptioned image]](https://arxiv.org/html/2605.11629v1/fig/case/0213case4.png)Question:Does Forest Green have the maximum area under the curve? Answer yes or no\.■\\blacksquareOriginal Annotation \(FineVision\):Yes\.★\\bigstarStructured Annotation \(Ours\):
<think\>
To determine whether Forest Green has the maximum area under the curve, we need to compare the areas under all lines in the graph\. The area under a curve is proportional to how high the line stays above thexx\-axis over the entire range\.Step 1: Analyze each line\.Cyanstarts at the highest value \(∼\\sim98\) and decreases steadily but remains relatively high throughout\.Light Slatealso starts high and decreases gradually, staying above most other lines except Cyan\.Olivestarts high but drops sharply after the first few points\.Peruremains fairly flat around 78–79\.Goldstarts low and increases slowly\.Light Goldstarts low and increases steadily, crossing some lines but not reaching the top\.Forest Greenstarts around 75, peaks slightly, then drops sharply and ends near zero\.Step 2: Compare areas\.Cyan and Light Slate are the two highest lines throughout the range\. Cyan starts higher and maintains a higher position for most of the range\. Forest Green, while starting moderately high, drops quickly and ends near zero, so its area is significantly smaller\.Conclusion:Forest Green doesnothave the maximum area under the curve\. Cyan has the largest area due to its consistently high values across the entire range\.</think\><answer\>No</answer\>Case 5![[Uncaptioned image]](https://arxiv.org/html/2605.11629v1/fig/case/0213case5.png)Question:Point out the person in the picture who is not wearing a helmet\.■\\blacksquareOriginal Annotation \(FineVision\):In the image, there is a person wearing a pink shirt and dark pants, standing near the center of the image to the right of a blue truck, who is not wearing a helmet\. This individual is in contrast to the others around who are wearing helmets, presumably for safety reasons at this construction or work site\.★\\bigstarStructured Annotation \(Ours\):
<think\>
To determine who is not wearing a helmet, I will examine each person in the image:Step 1: Enumerate all individuals\.\(1\) Person on the left side of the blue truck: orange outfit, blue helmet\.\(2\) Person next to them: orange outfit, blue helmet\.\(3\) Person in the center\-left: red outfit, blue helmet\.\(4\) Person in the center\-right: dark outfit, blue helmet\.\(5\) Person near the bottom center: gray shirt, black pants,no helmet; holding a measuring tape\.\(6\) Person near the right side under the makeshift shelter: blue helmet\.\(7\) Person near the far right, partially obscured: blue helmet\.Step 2: Identify the exception\.After reviewing all individuals, only person \(5\) in the gray shirt is not wearing a helmet\.</think\><answer\>The person in the gray shirt near the bottom center of the image is not wearing a helmet\.</answer\>
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Similar Articles

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

Reasoning Compression with Mixed-Policy Distillation

Reinforcing Multimodal Reasoning Against Visual Degradation

Structural Rationale Distillation via Reasoning Space Compression

Submit Feedback

Similar Articles

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
Reasoning Compression with Mixed-Policy Distillation
Reinforcing Multimodal Reasoning Against Visual Degradation
Structural Rationale Distillation via Reasoning Space Compression