Tag
Introduces Perceive-to-Reason (P2R), a framework that decouples visual perception from reasoning in vision-language models using a two-stage process and a role-aware reinforcement learning strategy, achieving state-of-the-art results on fine-grained visual reasoning benchmarks.
V-Zero is a novel label-free framework for fine-grained visual reasoning that uses contrastive evidence gating and on-policy distillation to improve performance without annotated answer labels, achieving faster training than traditional methods.