Omni-DuplexEval：评估实时双工全模态交互

摘要

实时双工交互对于在真实场景中运行的多模态AI系统至关重要，这类模型必须持续处理流式输入并在适当时机做出响应。然而，现有大多数多模态大语言模型（MLLMs）均在离线场景下进行评估——即需处理完整视频输入后才生成响应。尽管近期研究已开始探索实时双工MLLMs，但针对这一场景仍缺乏系统性的基准测试和自动评估方法。为解决这一空白，我们提出Omni-DuplexEval——一个专门用于系统性评估实时双工交互的基准测试。该基准包含两个互补场景：（1）实时描述，评估模型生成连续、时间对齐响应以追踪动态多模态输入的能力；（2）主动提醒，评估模型识别显著事件并在适当时机响应的能力。Omni-DuplexEval包含660个视频，配有细粒度的人工标注标签和精确时间元数据，涵盖9个基于真实场景的任务，所有问题均采用开放式问答形式。我们进一步引入基于大语言模型裁判（LLM-as-a-Judge）的自动评估框架，通过时间戳感知与顺序推理联合评估响应内容对齐性和响应时机，实现与人类判断的高度一致性。对当前最先进双工MLLMs的实验揭示了显著局限性：最佳模型综合得分仅39.6%，其中主动提醒场景得分仅20.0%。分析识别出两大核心挑战：模型难以在及时响应与连贯整体内容生成之间取得平衡，且经常无法同时确定响应时机与生成内容。我们希望本研究能推动MLLMs领域的进一步发展。

English

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.