Omni-DuplexEval: 실시간 양방향 전방위 모달 상호작용 평가

초록

실시간 듀플렉스 상호작용은 실제 세계 시나리오에서 작동하는 멀티모달 AI 시스템에 필수적이며, 이러한 시스템에서는 모델이 스트리밍 입력을 지속적으로 처리하고 적절한 순간에 응답해야 합니다. 그러나 대부분의 기존 멀티모달 대규모 언어 모델(MLLM)은 오프라인 환경에서 평가되며, 이 경우 응답이 생성되기 전에 전체 비디오 입력이 처리됩니다. 최근 연구에서 실시간 듀플렉스 MLLM을 탐구하기 시작했지만, 이 설정에 대한 포괄적인 벤치마크나 자동 평가 방법은 아직 없습니다. 이러한 격차를 해결하기 위해, 우리는 실시간 듀플렉스 상호작용을 체계적으로 평가하기 위한 벤치마크인 Omni-DuplexEval을 제안합니다. 이 벤치마크는 두 가지 상호 보완적인 시나리오로 구성됩니다: (1) 실시간 설명(Real-Time Description) - 진화하는 멀티모달 입력을 추적하는 지속적이고 시간 정렬된 응답을 생성하는 능력을 평가하며, (2) 선제적 알림(Proactive Reminder) - 현저한 이벤트를 식별하고 적절한 순간에 응답하는 능력을 평가합니다. Omni-DuplexEval은 세분화된 인간 주석 레이블과 정밀한 시간 메타데이터가 포함된 660개의 비디오를 포함하며, 실제 세계 시나리오에 기반한 9개의 작업에 걸쳐 있으며, 모든 질문은 개방형 질문으로 구성됩니다. 또한 LLM-as-a-Judge 기반의 자동 평가 체계를 도입하여 타임스탬프 인식 및 순차적 추론을 통해 응답 내용 정합성과 응답 타이밍을 공동으로 평가함으로써 체계적인 평가를 가능하게 하며, 인간 판단과 높은 일치를 달성합니다. 최첨단 듀플렉스 MLLM에 대한 실험은 상당한 한계를 드러냅니다. 가장 우수한 성능의 모델도 전체 39.6%에 그쳤으며, 선제적 알림에서는 20.0%만을 기록했습니다. 우리의 분석은 두 가지 주요 과제를 식별합니다: 모델은 적시 응답과 일관되고 전체적인 콘텐츠 생성을 균형 맞추는 데 어려움을 겪으며, 언제 응답해야 하는지와 무엇을 생성해야 하는지를 모두 결정하는 데 종종 실패합니다. 우리의 연구가 MLLM의 추가 발전에 기여하기를 바랍니다.

English

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.