ChatPaper.aiChatPaper

QuickVideo:系统算法协同设计的实时长视频理解

QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

May 22, 2025
作者: Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen
cs.AI

摘要

长视频理解已成为现实应用中的一项关键能力,如视频监控、会议摘要、教育讲座分析及体育赛事转播等领域。然而,对于视频大模型(VideoLLMs)而言,这一任务仍面临计算上的巨大挑战,主要受限于两大瓶颈:一是顺序视频解码,即将原始比特流转换为RGB帧的过程,对于长达一小时的视频输入,耗时可达一分钟;二是大模型推理时需预先填充多达数百万个令牌,导致高延迟与内存占用。为应对这些挑战,我们提出了QuickVideo,一种系统与算法协同设计,显著加速长视频理解,以支持实时下游应用。该方案包含三大创新点:QuickDecoder,一种基于CPU的并行视频解码器,通过将视频分割为关键帧对齐的区间并发处理,实现2至3倍的加速;QuickPrefill,一种内存高效的预填充方法,利用KV缓存剪枝技术,以更少的GPU内存支持更多帧;以及一种重叠方案,使CPU视频解码与GPU推理并行进行。这些组件共同作用,将长视频输入的推理时间减少了一分钟,即便在有限硬件条件下也能实现可扩展、高质量的视频理解。实验表明,QuickVideo能够适应不同时长与采样率,使长视频处理在实践中变得可行。
English
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.

Summary

AI-Generated Summary

PDF312May 23, 2025