ChatPaper.aiChatPaper

SocialOmni:全息模型视听社交互动能力基准测试

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

March 17, 2026
作者: Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji
cs.AI

摘要

全模态大语言模型(OLMs)通过原生整合音频、视觉与文本,重新定义了人机交互的边界。然而,现有OLM基准测试仍固守静态的、以准确性为核心的任务范式,未能有效评估社会交互性这一支撑自然对话中动态线索处理的核心能力。为此,我们提出SocialOmni基准框架,从三个核心维度系统化评估对话交互能力:(一)说话人分离与身份识别(谁在说话),(二)插话时机控制(何时介入),(三)自然插话生成(如何表达)。该基准包含2,000个感知样本及经严格质控的209组交互生成诊断实例,这些实例均具有精确的时间与上下文约束,并辅以受控的视听不一致场景以检验模型鲁棒性。我们对12个主流OLM进行测试,发现不同模型的社会交互能力存在显著差异。更关键的是,分析表明模型的感知准确性与其生成情境适配插话的能力存在明显解耦,这揭示仅依靠理解导向的度量指标不足以全面刻画对话社会胜任力。值得期待的是,SocialOmni的诊断结果为未来OLM弥合感知与交互间的鸿沟提供了可操作的改进方向。
English
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
PDF322March 19, 2026