GRASP：学习在多人非语言交互中锚定社会推理

摘要

理解社交互动需要对微妙的非语言线索进行推理，然而当前的多模态大语言模型（MLLMs）在多人物视频中常常无法识别谁在与谁互动。我们提出了GRASP，一个大规模社交推理数据集，将高层级社交问答与细粒度的注视和指示性手势事件联系起来。GRASP包含覆盖46K个视频（总计749小时）的290K个问答对，按照涉及注视、手势以及注视-手势联合推理的16个类别分类体系组织，并配套用于评估的GRASP-Bench。与以往仅关注孤立线索或高层级社交问答的资源不同，GRASP基于身份一致的注视轨迹、指示性手势及其在社交事件中的联合构成来构建问题。此外，我们提出了社交接地奖励（SGR），这是一种利用这些社交事件来鼓励模型推理每个互动中参与者的学习信号。实验表明，在保持相关社交视频问答基准零样本性能的同时，SGR提升了在GRASP-Bench上的表现。

English

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.