Tag
Researchers introduce Groupwise Ranking Reward to fix reasoning-answer inconsistency in multimodal RL, boosting reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.