無限視界:線性與稀疏注意力協同的高效能無限輸入視覺語言模型
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
December 9, 2025
作者: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang
cs.AI
摘要
視窗注意力與線性注意力是解決視覺語言模型(VLM)中二次複雜度和持續增長的KV快取的兩種主流策略。然而我們觀察到,基於視窗的VLM在序列長度超過視窗大小時會出現性能衰退,而線性注意力在OCR和文檔理解等資訊密集型任務上表現欠佳。為突破這些限制,我們提出InfiniteVL——一種融合滑動視窗注意力(SWA)與門控DeltaNet的線性複雜度VLM架構。為在受限資源下實現具競爭力的多模態性能,我們設計了包含蒸餾預訓練、指令微調和長序列SFT的三階段訓練策略。值得注意的是,僅使用領先VLM所需訓練數據的不到2%,InfiniteVL不僅顯著超越先前所有線性複雜度VLM,更可媲美基於Transformer的頂尖VLM性能,同時展現出有效的長期記憶保留能力。相比經FlashAttention-2加速的同規模Transformer架構VLM,InfiniteVL在保持恆定延遲與記憶體佔用的前提下,實現超過3.6倍的推理加速。在串流影片理解場景中,它能維持穩定的24 FPS實時預填充速度,同時保留長期記憶快取。程式碼與模型已開源於:https://github.com/hustvl/InfiniteVL。
English
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.