ViSpeak: 스트리밍 비디오에서의 시각적 지침 피드백

초록

최근 대규모 멀티모달 모델(Large Multi-modal Models, LMMs)의 발전은 주로 오프라인 비디오 이해에 초점을 맞추고 있습니다. 반면, 스트리밍 비디오 이해는 시간에 민감하고, 모든 모달리티를 포함하며, 상호작용적인 특성으로 인해 최근 모델들에게 큰 도전 과제로 남아 있습니다. 본 연구에서는 스트리밍 비디오 이해를 새로운 관점에서 확장하고, 모델이 시각적 콘텐츠를 인식하고 이를 통해 지시를 추출할 수 있어야 하는 새로운 과제인 '시각적 지시 피드백(Visual Instruction Feedback)'을 제안합니다. 예를 들어, 사용자가 에이전트에게 손을 흔들면, 에이전트는 그 제스처를 인식하고 환영 메시지로 대화를 시작해야 합니다. 따라서 시각적 모달리티에서 지시를 따르는 것은 사용자와 에이전트 간의 상호작용을 크게 향상시킵니다. 연구를 촉진하기 위해, 우리는 시각적 모달리티와 밀접하게 관련된 7개의 주요 하위 과제를 정의하고, 학습을 위한 ViSpeak-Instruct 데이터셋과 평가를 위한 ViSpeak-Bench 데이터셋을 수집했습니다. 또한, 다양한 스트리밍 비디오 이해 벤치마크에서 GPT-4o 수준의 성능을 보이는 최첨단 스트리밍 비디오 이해 LMM인 ViSpeak 모델을 제안합니다. ViSpeak-Instruct 데이터셋으로 미세 조정한 후, ViSpeak은 기본적인 시각적 지시 피드백 능력을 갖추게 되어 향후 연구를 위한 견고한 기준선으로서 역할을 합니다.

English

Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.