Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Papers with Code Trending Papers

Summary

The paper introduces PRISM, a method that inserts a distribution-alignment stage between supervised fine-tuning and reinforcement learning to mitigate distributional drift in multimodal models. It uses a black-box adversarial game with an MoE discriminator to improve RLVR performance on models like Qwen3-VL.

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:46 AM

Paper page - Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Source: https://huggingface.co/papers/2604.28123

Abstract

PRISM addresses distributional drift in multimodal models by inserting a distribution-alignment stage between supervised fine-tuning and reinforcement learning with verifiable rewards, using a black-box adversarial game between policy and MoE discriminator for disentangled corrective signals.

The standard post-training recipe for large multimodal models (LMMs) appliessupervised fine-tuning(SFT) on curated demonstrations followed byreinforcement learning with verifiable rewards(RLVR). However, SFT introducesdistributional driftthat neither preserves the model’s original capabilities nor faithfully matches the supervision distribution. This problem is further amplified inmultimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle ofon-policy distillation(OPD), PRISM casts alignment as a black-box,response-level adversarial gamebetween thepolicyand aMixture-of-Experts(MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer thepolicytoward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring densevisual groundingandstep-by-step reasoningon the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.

View arXiv pageView PDFProject pageGitHub63Add to collection

Get this paper in your agent:

hf papers read 2604\.28123

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper6

#### prism-vlm/Qwen3-VL-4B-Instruct-SFT-PRISM-GRPO Image-Text-to-Text• 5B• Updated2 days ago • 41 #### prism-vlm/Qwen3-VL-4B-Instruct-SFT-PRISM-GSPO Image-Text-to-Text• 5B• Updated2 days ago • 42 #### prism-vlm/Qwen3-VL-4B-Instruct-SFT-PRISM-DAPO Image-Text-to-Text• 5B• Updated2 days ago • 53 #### prism-vlm/Qwen3-VL-8B-Instruct-SFT-PRISM-GSPO Image-Text-to-Text• 9B• Updated2 days ago • 35 Browse 6 models citing this paper## Datasets citing this paper3

#### prism-vlm/gemini_public_mmr1 Viewer• Updated2 days ago • 1.27M • 408 • 2 #### prism-vlm/gemini_distill Viewer• Updated2 days ago • 108k • 210 #### prism-vlm/rl_dataset Viewer• Updated2 days ago • 20.3k • 48

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.28123 in a Space README.md to link it from this page.

Collections including this paper2

Similar Articles

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

arXiv cs.LG

Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.

Structured Role-Aware Policy Optimization for Multimodal Reasoning

arXiv cs.AI

This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.