SocialOmni: オムニモデルにおける視聴覚的社会的相互作用のベンチマーク

要旨

オムニモーダル大規模言語モデル（OLM）は、音声・視覚・テキストをネイティブに統合することで人間と機械の相互作用を再定義する。しかし、既存のOLMベンチマークは静的で精度中心のタスクに留まっており、自然対話における動的指標を扱う根本的な能力である社会的相互行為性の評価に重大な隔たりが生じている。この問題に対処するため、本論文ではSocialOmniを提案する。これは対話的相互行為性の評価を、(i) 話者分離・識別（誰が話しているか）、(ii) 割り込みタイミング制御（いつ割り込むか）、(iii) 自然な割り込み生成（どのように割り込むか）という3つの核心次元に沿って操作化した包括的ベンチマークである。SocialOmniは2000の知覚サンプルと、時間的・文脈的制約が厳格に管理された209の相互作用生成インスタンスからなる品質管理診断セットを特徴とし、モデルの頑健性をテストするための制御された視聴覚的不整合シナリオで補完されている。我々は12の主要なOLMを評価し、モデル間で社会的相互作用能力に顕著なばらつきがあることを明らかにした。さらに分析により、モデルの知覚精度と文脈適切な割り込み生成能力との間に顕著な乖離が存在することが判明し、理解中心の指標だけでは対話的社会的適性を特徴づけるには不十分であることを示唆している。より鼓舞されることに、SocialOmniからのこれらの診断結果は、将来のOLMにおいて知覚と相互作用の隔たりを埋めるための実践的な示唆を提供する。

English

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

SocialOmni: オムニモデルにおける視聴覚的社会的相互作用のベンチマーク

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

要旨

Support