SocialOmni: 올니 모델의 오디오-비주얼 사회적 상호작용성 벤치마킹

초록

옴니모달 대규모 언어 모델(OLM)은 오디오, 비전, 텍스트를 기본적으로 통합함으로써 인간-기계 상호작용을 재정의합니다. 그러나 기존 OLM 벤치마크는 정적이고 정확도 중심의 과제에 머물러 있어, 자연스러운 대화에서 역동적인 신호를 해석하는 근본적인 능력인 사회적 상호작용성 평가에 중요한 공백이 존재합니다. 이에 우리는 대화적 상호작용성 평가를 다음 세 가지 핵심 차원에서 운영화하는 포괄적 벤치마크인 SocialOmni를 제안합니다: (i) 화자 분리 및 식별(누가 말하는가), (ii) 방해 타이밍 제어(언제 끼어들 것인가), (iii) 자연스러운 방해 발화 생성(어떻게 표현할 것인가). SocialOmni는 2,000개의 인지 샘플과 엄격한 시간적·맥락적 제약이 있는 209개의 품질 관리된 상호작용 생성 진단 세트를 특징으로 하며, 모델 강건성을 테스트하기 위한 통제된 시청각 불일치 시나리오로 보완됩니다. 우리는 12개의 주요 OLM을 벤치마킹한 결과, 모델 간 사회적 상호작용 능력에서 상당한 편차를 확인했습니다. 더 나아가, 우리의 분석은 모델의 인지 정확도와 맥락적으로 적절한 방어 발화 생성 능력 사이에 현저한 분리가 있음을 보여주며, 이는 이해 중심 지표만으로는 대화적 사회적 능력을 규명하기에 부족함을 시사합니다. 더 고무적인 점은 SocialOmni의 이러한 진단이 향후 OLM의 인지-상호작용 간극을 해결하기 위한 실행 가능한 신호를 제공한다는 것입니다.

English

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

SocialOmni: 올니 모델의 오디오-비주얼 사회적 상호작용성 벤치마킹

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

초록

Support