无限视觉语言模型：融合线性与稀疏注意力机制，实现高效处理无限输入的视觉语言模型

摘要

窗口注意力与线性注意力是解决视觉语言模型（VLMs）中二次复杂度及持续增长的KV缓存问题的两大主流策略。然而我们发现，基于窗口的VLMs在序列长度超过窗口大小时会出现性能下降，而线性注意力在OCR、文档理解等信息密集型任务上表现欠佳。为突破这些局限，我们提出InfiniteVL——一种融合滑动窗口注意力与门控DeltaNet的线性复杂度VLM架构。为在有限资源下实现具有竞争力的多模态性能，我们设计了包含蒸馏预训练、指令微调与长序列SFT的三阶段训练策略。值得注意的是，仅使用顶尖VLMs所需训练数据不到2%的情况下，InfiniteVL不仅显著超越以往的线性复杂度VLMs，更可媲美基于Transformer的顶尖VLMs性能，同时展现出有效的长程记忆保持能力。相较于采用FlashAttention-2加速的同规模Transformer VLMs，InfiniteVL在保持恒定延迟与内存占用的同时，实现了超过3.6倍的推理加速。在流式视频理解场景中，该模型能以稳定的24 FPS实时预填充速度运行，并维持长程记忆缓存。代码与模型已开源：https://github.com/hustvl/InfiniteVL。

English

Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

无限视觉语言模型：融合线性与稀疏注意力机制，实现高效处理无限输入的视觉语言模型

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

摘要

Support