Tag
The paper introduces PRISM, a method that inserts a distribution-alignment stage between supervised fine-tuning and reinforcement learning to mitigate distributional drift in multimodal models. It uses a black-box adversarial game with an MoE discriminator to improve RLVR performance on models like Qwen3-VL.