Tag
OmniToM introduces a benchmark that evaluates large language models' theory of mind by requiring explicit belief structure extraction and labeling, revealing a bottleneck in tracking actor-specific beliefs despite strong performance on endpoint QA tasks.
GRASP is a large-scale dataset for social reasoning in multi-person videos, connecting high-level social questions with fine-grained gaze and gesture events, and introduces Social Grounding Reward to improve multimodal model understanding.
RoleConflictBench is a novel benchmark containing over 13,000 scenarios across 65 roles designed to evaluate how well LLMs handle contextual sensitivity in role conflict situations where multiple social expectations clash. Analysis of 10 LLMs reveals that models predominantly rely on learned role preferences rather than dynamic contextual cues when making decisions.