GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Hugging Face Daily Papers Papers

Summary

GRASP is a large-scale dataset for social reasoning in multi-person videos, connecting high-level social questions with fine-grained gaze and gesture events, and introduces Social Grounding Reward to improve multimodal model understanding.

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.
Original Article
View Cached Full Text

Cached at: 05/19/26, 10:31 AM

Paper page - GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Source: https://huggingface.co/papers/2605.15764

Abstract

GRASP is a large-scale social reasoning dataset connecting high-level social questions with fine-grained gaze and gesture events, along with Social Grounding Reward to improve multimodal model understanding of social interactions.

Understanding social interactions requires reasoning over subtle non-verbal cues, yet currentmultimodal large language models(MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scalesocial reasoningdataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together withGRASP-Benchfor evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistentgaze trajectories,deictic gestures, and their joint compositions intosocial events. Moreover, we proposeSocial Grounding Reward(SGR), a learning signal that uses thesesocial eventsto encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance onGRASP-Benchwhile maintaining zero-shot performance on related social video QA benchmarks.

View arXiv pageView PDFAdd to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15764 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15764 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15764 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

A Very Big Video Reasoning Suite

Papers with Code Trending

This paper introduces the Very Big Video Reasoning (VBVR) dataset and benchmark, a large-scale resource with over one million video clips across 200 reasoning tasks, enabling systematic study of spatiotemporal reasoning and showing early signs of emergent generalization.

GraphReAct: Reasoning and Acting for Multi-step Graph Inference

arXiv cs.AI

This paper introduces GraphReAct, a framework that extends reasoning-acting paradigms to graph-structured data for multi-step inference. It combines topological and semantic retrieval with context refinement to improve performance on graph learning benchmarks.