Tag
Proposes a reinforcement learning-based post-training method using Group Relative Policy Optimization (GRPO) and chain-of-thought supervision to improve classification and explanation quality for hateful and propagandistic meme detection in thinking-based multimodal large language models, achieving improvements on the Hateful Memes and ArMeme benchmarks.