全雙工全模態評估：評估即時雙向全模態互動

摘要

即時雙向互動對於在真實場景中運作的多模態AI系統至關重要，此類模型必須持續處理串流輸入並在適當時刻做出回應。然而，現有大多數多模態大型語言模型（MLLMs）多在離線設定下進行評估，即在生成任何回應前先處理完整個影片輸入。儘管近期研究已開始探討即時雙向MLLMs，但此領域仍缺乏全面的基準測試或自動化評估方法。為填補此缺口，我們提出Omni-DuplexEval，一個用於系統性評估即時雙向互動的基準。該基準包含兩個互補場景：（1）即時描述，評估生成持續且時間對齊的回應以追蹤動態多模態輸入的能力；（2）主動提醒，評估辨識顯著事件並在適當時刻回應的能力。Omni-DuplexEval包含660支影片，附有精細的人工標註標籤與精確的時間元數據，涵蓋9項基於真實場景的任務，所有問題皆以開放式提問形式呈現。我們進一步引入基於LLM-as-a-Judge（以大型語言模型作為評審）的自動化評估框架，透過時間戳感知與序列推理共同評估回應內容的對齊程度與回應時機，從而實現系統性評估，並與人類判斷高度一致。對最新雙向MLLMs的實驗揭示了其顯著局限性。表現最佳的模型總體得分僅為39.6%，而在主動提醒項目中僅得20.0%。我們的分析指出了兩項關鍵挑戰：模型難以在及時回應與連貫全面的內容生成之間取得平衡，並且常無法決定何時回應以及該產生何種內容。我們期望此項工作能促進MLLMs領域的進一步發展。

English

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.