GRASP:學習在多人非語言互動中奠定社交推理的基礎
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
May 15, 2026
作者: Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg
cs.AI
摘要
理解社交互動需要推論微妙非語言線索,然而當前多模態大型語言模型(MLLMs)在多人物影片中常無法正確辨識互動對象。我們提出GRASP,一個大規模社交推理資料集,將高層次社交問答與細粒度的凝視與指示手勢事件連結。GRASP包含46K部影片(共749小時)中的290K組問答對,依據涵蓋凝視、手勢及凝視-手勢聯合推理的16類分類體系組織,並搭配GRASP-Bench進行評估。與聚焦孤立線索或高層次社交問答的現有資源不同,GRASP從身份一致的凝視軌跡、指示手勢及其聯合組成的社交事件構建問題。此外,我們提出社會基礎獎勵(SGR),這是一種學習訊號,利用這些社交事件鼓勵模型對每個互動中的參與者進行推理。實驗顯示,SGR在提升GRASP-Bench表現的同時,亦能維持相關社交影片問答基準的零樣本性能。
English
Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.