LiteFrame：高效視覺編碼器解鎖視頻LLM中的幀規模擴展

摘要

擴展影片大語言模型（Video Large Language Models, Video LLMs）以處理長影片時，根本的挑戰在於管理視覺令牌上下文長度的爆炸性增長。現有策略主要聚焦於「事後」令牌縮減——即在特徵提取後減少視覺令牌，以減輕LLM的計算負擔。然而我們觀察到，儘管這類方法有效降低了視覺令牌數量，主要的延遲瓶頸卻從LLM轉移至視覺編碼器昂貴的逐幀處理。為解決此問題，我們提出LiteFrame——一個強健且高效的影片編碼器主幹，專為Video LLM設計。為訓練LiteFrame，我們提出壓縮令牌蒸餾（Compressed Token Distillation, CTD），這是一種新穎的訓練框架，教導緊湊的學生視覺編碼器直接預測由大型教師視覺模型所產生的信息密集、時空壓縮表徵，從而有效繞過冗餘計算。當進一步結合語言模型適應（Language Model Adaptation, LMA）時，此方法形成了新的延遲-準確率帕累托前沿——相較於InternVL3-8B，LiteFrame在處理8倍數量幀的同時，端到端延遲降低35%，並在多項基準測試中提升了平均影片理解準確率。我們的成果展示了在固定計算預算下，解鎖更長影片理解的新潛在路徑。

English

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8times more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.