SocialOmni:全向模型中视听社交互动能力的基准评测
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
March 17, 2026
作者: Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji
cs.AI
摘要
全模态大语言模型(OLMs)通过原生整合音频、视觉与文本,重新定义了人机交互范式。然而现有OLM基准仍固守静态的、以准确率为核心的任务范式,未能有效评估自然对话中处理动态社交线索的关键能力——社会交互性。为此,我们提出SocialOmni基准框架,从三个核心维度系统化评估对话交互能力:(一)说话人分离与身份识别(谁在说话),(二)插话时机控制(何时介入),(三)自然插话生成(如何表达)。该基准包含2,000个感知样本及经质控的209个交互生成诊断实例,这些实例具有严格的时序与上下文约束,并辅以受控的视听不一致场景以检验模型鲁棒性。我们对12个主流OLM进行测试,发现其社交交互能力存在显著差异。更关键的是,分析表明模型的感知准确度与生成情境适配插话的能力存在明显解耦,这预示着仅靠理解导向的指标不足以表征对话社交能力。值得期待的是,SocialOmni的诊断结果为未来OLM弥合感知与交互的鸿沟提供了可操作的改进方向。
English
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.