Tag
SAGA framework uses frozen multimodal large language models to provide attribute-aware supervision for vision encoders via Group Relative Policy Optimization, improving zero-shot image retrieval by 3–6 points on fine-grained benchmarks.