QuickVideo:系統算法協同設計實現的實時長視頻理解
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
May 22, 2025
作者: Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen
cs.AI
摘要
長視頻理解已成為現實世界應用中的關鍵能力,如視頻監控、會議摘要、教育講座分析和體育廣播。然而,對於視頻大語言模型(VideoLLMs)而言,這仍然在計算上具有挑戰性,主要由於兩個瓶頸:1)順序視頻解碼,即從原始比特流轉換為RGB幀的過程,對於長達一小時的視頻輸入可能需要長達一分鐘;2)大語言模型推理中高達數百萬個令牌的昂貴預填充,導致高延遲和內存使用。為應對這些挑戰,我們提出了QuickVideo,這是一種系統算法協同設計,顯著加速了長視頻理解,以支持實時的下游應用。它包含三個關鍵創新:QuickDecoder,一種基於CPU的並行化視頻解碼器,通過將視頻分割為關鍵幀對齊的區間並行處理,實現了2-3倍的加速;QuickPrefill,一種內存高效的預填充方法,利用KV緩存剪枝支持更多幀的同時減少GPU內存使用;以及一種重疊方案,使CPU視頻解碼與GPU推理重疊進行。這些組件共同作用,將長視頻輸入的推理時間減少了一分鐘,即使在有限的硬件上也能實現可擴展、高質量的視頻理解。實驗表明,QuickVideo在持續時間和採樣率上具有通用性,使長視頻處理在實踐中變得可行。
English
Long-video understanding has emerged as a crucial capability in real-world
applications such as video surveillance, meeting summarization, educational
lecture analysis, and sports broadcasting. However, it remains computationally
prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential
video decoding, the process of converting the raw bit stream to RGB frames can
take up to a minute for hour-long video inputs, and 2) costly prefilling of up
to several million tokens for LLM inference, resulting in high latency and
memory use. To address these challenges, we propose QuickVideo, a
system-algorithm co-design that substantially accelerates long-video
understanding to support real-time downstream applications. It comprises three
key innovations: QuickDecoder, a parallelized CPU-based video decoder that
achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals
processed concurrently; QuickPrefill, a memory-efficient prefilling method
using KV-cache pruning to support more frames with less GPU memory; and an
overlapping scheme that overlaps CPU video decoding with GPU inference.
Together, these components infernece time reduce by a minute on long video
inputs, enabling scalable, high-quality video understanding even on limited
hardware. Experiments show that QuickVideo generalizes across durations and
sampling rates, making long video processing feasible in practice.Summary
AI-Generated Summary