用於視頻多模態大型語言模型的慢快架構
Slow-Fast Architecture for Video Multi-Modal Large Language Models
April 2, 2025
作者: Min Shi, Shihao Wang, Chieh-Yun Chen, Jitesh Jain, Kai Wang, Junjun Xiong, Guilin Liu, Zhiding Yu, Humphrey Shi
cs.AI
摘要
在有限的計算預算下平衡時間解析度與空間細節,仍然是基於視頻的多模態大型語言模型(MLLMs)面臨的關鍵挑戰。現有方法通常在將視頻表示輸入LLM之前,使用預定義規則進行壓縮,導致不可逆的信息丟失,且往往忽略輸入指令。為解決這一問題,我們提出了一種新穎的慢快架構,自然規避了這一權衡,使得在保留空間細節的同時能夠使用更多輸入幀。受人類先快速瀏覽視頻再聚焦相關部分的啟發,我們的慢快設計採用了雙令牌策略:1)“快”視覺令牌——一組緊湊的壓縮視頻特徵——與文本嵌入一起輸入LLM,提供快速概覽;2)“慢”視覺令牌——未壓縮的視頻特徵——通過專門設計的混合解碼器層由文本嵌入進行交叉注意力,實現指令感知的相關視覺細節提取,且具有線性複雜度。我們進行了系統性探索,以優化整體架構及關鍵組件。實驗表明,我們的模型顯著優於僅使用自注意力的基線,將輸入容量從16幀擴展至128幀,而計算量僅增加3%,並在五個視頻理解基準測試中平均性能提升16%。我們的7B模型在同等規模模型中達到了最先進的性能。此外,我們的慢快架構是一種即插即用的設計,可集成到其他視頻MLLMs中,以提高效率和可擴展性。
English
Balancing temporal resolution and spatial detail under limited compute budget
remains a key challenge for video-based multi-modal large language models
(MLLMs). Existing methods typically compress video representations using
predefined rules before feeding them into the LLM, resulting in irreversible
information loss and often ignoring input instructions. To address this, we
propose a novel slow-fast architecture that naturally circumvents this
trade-off, enabling the use of more input frames while preserving spatial
details. Inspired by how humans first skim a video before focusing on relevant
parts, our slow-fast design employs a dual-token strategy: 1) "fast" visual
tokens -- a compact set of compressed video features -- are fed into the LLM
alongside text embeddings to provide a quick overview; 2) "slow" visual tokens
-- uncompressed video features -- are cross-attended by text embeddings through
specially designed hybrid decoder layers, enabling instruction-aware extraction
of relevant visual details with linear complexity. We conduct systematic
exploration to optimize both the overall architecture and key components.
Experiments show that our model significantly outperforms self-attention-only
baselines, extending the input capacity from 16 to 128 frames with just a 3%
increase in computation, and achieving a 16% average performance improvement
across five video understanding benchmarks. Our 7B model achieves
state-of-the-art performance among models of similar size. Furthermore, our
slow-fast architecture is a plug-and-play design that can be integrated into
other video MLLMs to improve efficiency and scalability.Summary
AI-Generated Summary