LiteFrame：高效视觉编码器解锁视频大语言模型中的帧缩放能力

摘要

视频大语言模型（Video LLMs）在扩展至长视频理解时面临的核心挑战，在于如何应对视觉标记上下文长度的爆炸式增长。现有策略主要聚焦于“事后”标记缩减——即在特征提取后减少视觉标记数量，以降低大语言模型的计算开销。尽管这些方法有效减少了视觉标记数量，但我们观察到，主要的延迟瓶颈已从大语言模型转移到了视觉编码器对每一帧进行的高成本处理上。为解决这一问题，我们提出了LiteFrame——一个强大且高效的视频编码器主干，专为视频大语言模型设计。为训练LiteFrame，我们引入了压缩标记蒸馏（CTD），一种新颖的训练框架，它指导一个紧凑的学生视觉编码器直接预测由大型教师视觉模型生成的信息密集、时空压缩的表示，从而有效规避冗余计算。当与进一步的语言模型适配（LMA）结合时，该方法带来了新的延迟-准确率帕累托前沿——与InternVL3-8B相比，LiteFrame在端到端延迟降低35%的同时，处理帧数增加8倍，并在多个基准测试上提升了平均视频理解准确率。我们的结果表明，在固定计算预算下，这为解锁更长时间视频理解开辟了一条新路径。

English

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8times more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.