GRASP: 다중 인물 비언어적 상호작용에서 사회적 추론을 기반화하는 학습

초록

사회적 상호작용을 이해하려면 미묘한 비언어적 단서에 대한 추론이 필요하지만, 현재의 멀티모달 대규모 언어 모델(MLLM)은 다인 영상에서 누가 누구와 상호작용하는지 식별하는 데 종종 실패한다. 본 논문에서는 고수준의 사회적 질의응답(QA)을 미세한 시선 및 지시적 제스처 사건과 연결하는 대규모 사회적 추론 데이터셋 GRASP를 소개한다. GRASP는 총 749시간 분량의 46,000개 영상에 대해 29만 개의 질문-답변 쌍을 포함하며, 시선, 제스처, 그리고 시선-제스처 결합 추론을 아우르는 16개 범주의 분류 체계로 구성된다. 또한 평가를 위한 GRASP-Bench를 함께 제공한다. 고립된 단서나 고수준의 사회적 QA만을 다룬 기존 자원과 달리, GRASP는 정체성이 일관된 시선 궤적, 지시적 제스처, 그리고 이들이 결합된 사회적 사건으로부터 질문을 구성한다. 나아가, 각 상호작용에 참여하는 주체에 대한 추론을 모델에 장려하기 위해 사회적 사건을 활용하는 학습 신호인 Social Grounding Reward(SGR)를 제안한다. 실험 결과, SGR은 GRASP-Bench에서의 성능을 향상시키면서 관련 사회적 영상 QA 벤치마크에서의 제로샷 성능을 유지함을 보여준다.

English

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.