X-Stream: 다중 스트림 이해를 위한 MLLM의 멀티플렉서 역할 탐구

초록

비디오 스트리밍 이해 분야에서 상당한 진전이 이루어졌지만, 실시간 스포츠 중계, 자율주행, 멀티스크린 협업과 같은 실제 응용 환경은 본질적으로 지속적이고 멀티스트림 상호작용을 요구한다. 그러나 기존 벤치마크는 단일 스트림 패러다임에 국한되어 있어, 온라인에서의 스트림 간 추론을 평가하는 데 심각한 격차를 남기고 있다. 이러한 격차를 해소하기 위해, 우리는 멀티스트림 스트리밍 이해에 특화된 최초의 벤치마크인 X-Stream을 소개한다. X-Stream은 932개의 비디오에 걸쳐 엄격하게 선별된 4,220개의 질의응답 쌍으로 구성되어 있으며, 멀티윈도우, 멀티뷰, 멀티디바이스 시나리오 전반에 걸친 11개의 하위 과제를 평가한다. 핵심적으로, 우리의 데이터셋은 단일 스트림에 대한 과도한 의존을 방지하는 새로운 이중 검증 파이프라인을 사용하여 구축되었다. 나아가, 우리는 다중 모달 대규모 언어 모델(MLLM)을 순진한 다중화기로 개념화하는 선구적 작업을 수행하고, 신호 다중화 이론의 관점을 통해 이들의 성능을 체계적으로 평가한다. 광범위한 온라인 추론 실험 결과, 최첨단 MLLM은 동시 스트림에서 현저히 어려움을 겪어 약 50%의 점수만을 기록하고 낮은 능동적 능력을 보여주는 냉혹한 현실이 드러났다. 궁극적으로, X-Stream은 현재의 다중화 방식이 지닌 트레이드오프를 드러냄으로써, 차세대 멀티스트림 에이전트를 위한 실용적인 평가 프로토콜과 경험적 지침을 동시에 제공한다.

English

While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning. To bridge this, we introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. Comprising 4,220 rigorously curated QA pairs across 932 videos, X-Stream evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios. Crucially, our dataset is constructed using a novel dual-verification pipeline that prevents over-reliance on a single stream. Furthermore, we pioneer the conceptualization of multi-modal large language models (MLLMs) as naive multiplexers, systematically evaluating their performance through the lens of Signal Multiplexing Theory. Our extensive online inference experiments reveal a stark reality: state-of-the-art MLLMs struggle significantly with concurrent streams, achieving only about 50% score and exhibiting poor proactive ability. Ultimately, X-Stream exposes the trade-off of current multiplexing schemes, providing both a practical evaluation protocol and empirical guidance for next-generation multi-stream agents.