無限視界：線性與稀疏注意力協同的高效能無限輸入視覺語言模型

摘要

視窗注意力與線性注意力是解決視覺語言模型（VLM）中二次複雜度和持續增長的KV快取的兩種主流策略。然而我們觀察到，基於視窗的VLM在序列長度超過視窗大小時會出現性能衰退，而線性注意力在OCR和文檔理解等資訊密集型任務上表現欠佳。為突破這些限制，我們提出InfiniteVL——一種融合滑動視窗注意力（SWA）與門控DeltaNet的線性複雜度VLM架構。為在受限資源下實現具競爭力的多模態性能，我們設計了包含蒸餾預訓練、指令微調和長序列SFT的三階段訓練策略。值得注意的是，僅使用領先VLM所需訓練數據的不到2%，InfiniteVL不僅顯著超越先前所有線性複雜度VLM，更可媲美基於Transformer的頂尖VLM性能，同時展現出有效的長期記憶保留能力。相比經FlashAttention-2加速的同規模Transformer架構VLM，InfiniteVL在保持恆定延遲與記憶體佔用的前提下，實現超過3.6倍的推理加速。在串流影片理解場景中，它能維持穩定的24 FPS實時預填充速度，同時保留長期記憶快取。程式碼與模型已開源於：https://github.com/hustvl/InfiniteVL。

English

Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

無限視界：線性與稀疏注意力協同的高效能無限輸入視覺語言模型

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

摘要

Support