GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
Summary
GRASP is a large-scale dataset for social reasoning in multi-person videos, connecting high-level social questions with fine-grained gaze and gesture events, and introduces Social Grounding Reward to improve multimodal model understanding.
View Cached Full Text
Cached at: 05/19/26, 10:31 AM
Paper page - GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
Source: https://huggingface.co/papers/2605.15764
Abstract
GRASP is a large-scale social reasoning dataset connecting high-level social questions with fine-grained gaze and gesture events, along with Social Grounding Reward to improve multimodal model understanding of social interactions.
Understanding social interactions requires reasoning over subtle non-verbal cues, yet currentmultimodal large language models(MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scalesocial reasoningdataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together withGRASP-Benchfor evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistentgaze trajectories,deictic gestures, and their joint compositions intosocial events. Moreover, we proposeSocial Grounding Reward(SGR), a learning signal that uses thesesocial eventsto encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance onGRASP-Benchwhile maintaining zero-shot performance on related social video QA benchmarks.
View arXiv pageView PDFAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15764 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15764 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15764 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
A Very Big Video Reasoning Suite
This paper introduces the Very Big Video Reasoning (VBVR) dataset and benchmark, a large-scale resource with over one million video clips across 200 reasoning tasks, enabling systematic study of spatiotemporal reasoning and showing early signs of emergent generalization.
GraphReAct: Reasoning and Acting for Multi-step Graph Inference
This paper introduces GraphReAct, a framework that extends reasoning-acting paradigms to graph-structured data for multi-step inference. It combines topological and semantic retrieval with context refinement to improve performance on graph learning benchmarks.
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning
Introduces iVGR, a reinforcement learning framework that internalizes visual localization into textual reasoning for multimodal language models, eliminating the need for explicit visual grounding during inference while improving fine-grained perception performance.
Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning
This paper proposes MARS, a mono-anchored multi-source reasoning framework that uses dynamic anchors to quantify information gain and regulate modality interactions during reinforcement learning with verifiable rewards, achieving 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets.