비디오 스트리밍 사고: VideoLLM이 보면서 동시에 생각하는 능력

초록

온라인 비디오 대규모 언어 모델(VideoLLMs)은 반응적이고 실시간적인 상호작용을 지원하는 데 중요한 역할을 합니다. 기존 방법론은 스트리밍 인식에 초점을 맞추고 있어 동기화된 논리 추론 스트림이 부족합니다. 그러나 테스트 타임 스케일링 방법을 직접 적용하면 응답 지연 시간이 수용하기 어려운 수준으로 발생합니다. 이러한 절충점을 해결하기 위해 본 논문은 스트리밍 비디오 이해를 위한 새로운 패러다임인 Video Streaming Thinking(VST)을 제안합니다. VST는 시청 중 사고 메커니즘을 지원하며, 스트리밍 과정에서 수신되는 비디오 클립에 대한 추론을 활성화합니다. 이 설계는 LLM 추론 지연 시간을 비디오 재생 시간에 분산시켜 실시간 응답성을 유지하면서도 시의적절한 이해와 일관된 인지를 향상시킵니다. 더 나아가, 오프라인 VideoLLM을 인과적 스트리밍 추론에 구조적으로 적응시키는 VST-SFT와 다중 턴 비디오 상호작용 환경에서 자기 탐색을 통한 종단간 개선을 제공하는 VST-RL을 통합한 포괄적인 사후 훈련 파이프라인을 도입합니다. 또한, 비디오 지식 그래프를 활용하여 고품질의 스트리밍 질의-응답 쌍을 생성하고, 다중 증거 추론과 비디오 스트림에 대한 지속적 주의력을 강화하는 개체-관계 기반 스트리밍 사고 연쇄를 구축하는 자동화된 훈련 데이터 합성 파이프라인을 고안했습니다. 광범위한 평가 결과, VST-7B는 온라인 벤치마크(예: StreamingBench 79.5%, OVO-Bench 59.3%)에서 강력한 성능을 보였습니다. 동시에 VST는 오프라인 장편 형식 또는 추론 벤치마크에서도 경쟁력을 유지했습니다. Video-R1 대비 VST는 15.7배 빠른 응답 속도를 보였으며 VideoHolmes에서 +5.4%의 성능 향상을 달성하여 다양한 비디오 이해 작업에서 더 높은 효율성과 강력한 일반화 능력을 입증했습니다. 코드, 데이터 및 모델은 https://github.com/1ranGuan/VST에서 공개될 예정입니다.

English

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.

비디오 스트리밍 사고: VideoLLM이 보면서 동시에 생각하는 능력

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

초록

Support