基于分层令牌压缩的流式视频大语言模型加速方法

摘要

流式视频大语言模型（VideoLLMs）在各类视频理解任务中展现出卓越性能，但由于处理连续视频流中密集视觉令牌的高计算成本，其在实时部署中面临显著挑战。在流式视频场景下，主要瓶颈存在于视觉变换器（ViT）编码阶段——对时间相似帧的冗余处理导致效率低下。此外，大语言模型预填充阶段膨胀的令牌序列会进一步加剧延迟和内存开销。为应对这些挑战，我们提出流式令牌压缩（STC），这是一种可即插即用的分层框架，能无缝集成到现有流式VideoLLMs中，通过协同优化ViT编码和LLM预填充阶段来加速处理。STC引入双级令牌加速器：STC-Cacher通过缓存并复用时间相似帧的特征降低ViT编码开销；STC-Pruner则在视觉令牌序列输入LLM前进行压缩，基于时空相关性仅保留最显著的令牌。在五个基准测试平台上对四种主流流式VideoLLMs的广泛实验表明，STC优于其他压缩方法。值得注意的是，该框架在ReKV基准上保持高达99%的精度，同时将ViT编码延迟和LLM预填充延迟分别降低24.5%和45.3%。

English

Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose Streaming Token Compression (STC), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: STC-Cacher, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and STC-Pruner, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to 99\% of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by 24.5\% and 45.3\%.

基于分层令牌压缩的流式视频大语言模型加速方法

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

摘要

Support