ChatPaper.aiChatPaper

无限视觉语言模型:融合线性与稀疏注意力机制,实现高效处理无限输入的视觉语言模型

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

December 9, 2025
作者: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang
cs.AI

摘要

窗口注意力与线性注意力是解决视觉语言模型(VLMs)中二次复杂度及持续增长的KV缓存问题的两大主流策略。然而我们发现,基于窗口的VLMs在序列长度超过窗口大小时会出现性能下降,而线性注意力在OCR、文档理解等信息密集型任务上表现欠佳。为突破这些局限,我们提出InfiniteVL——一种融合滑动窗口注意力与门控DeltaNet的线性复杂度VLM架构。为在有限资源下实现具有竞争力的多模态性能,我们设计了包含蒸馏预训练、指令微调与长序列SFT的三阶段训练策略。值得注意的是,仅使用顶尖VLMs所需训练数据不到2%的情况下,InfiniteVL不仅显著超越以往的线性复杂度VLMs,更可媲美基于Transformer的顶尖VLMs性能,同时展现出有效的长程记忆保持能力。相较于采用FlashAttention-2加速的同规模Transformer VLMs,InfiniteVL在保持恒定延迟与内存占用的同时,实现了超过3.6倍的推理加速。在流式视频理解场景中,该模型能以稳定的24 FPS实时预填充速度运行,并维持长程记忆缓存。代码与模型已开源:https://github.com/hustvl/InfiniteVL。
English
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
PDF132December 13, 2025