GRASP: 多人数の非言語的相互作用における社会的推論の接地を学習する

要旨

社会的相互作用を理解するには、微妙な非言語的合図に基づく推論が必要であるが、現在のマルチモーダル大規模言語モデル（MLLM）は、複数人物が映る動画において誰が誰と相互作用的に関わっているかを特定することにしばしば失敗する。本稿では、高レベルの社会的QAと、視線および指示身振りイベントの詳細な情報を結びつける、大規模な社会的推論データセットGRASPを紹介する。GRASPは、合計749時間に及ぶ46,000本の動画に対して29万組の質問・回答ペアを含み、視線、身振り、および視線・身振り両方の推論にわたる16カテゴリの分類体系に整理されている。また、評価用のGRASP-Benchも併せて提供する。先行研究が単独の手がかりや高レベルの社会的QAのいずれかに焦点を当てていたのに対し、GRASPは、同一性が一貫した視線軌跡、指示身振り、およびそれらを社会的イベントとして合成したものから質問を構築する。さらに、社会的グラウンディング報酬（SGR）を提案する。これは、これらの社会的イベントを利用して、各相互作用に関与する参与者をモデルに推論させる学習信号である。実験により、SGRは関連する社会的動画QAベンチマークでのゼロショット性能を維持しつつ、GRASP-Benchの性能を向上させることが示された。

English

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.