InfiniteVL: 線形注意とスパース注意の融合による高効率・無制限入力の視覚言語モデル

要旨

ウィンドウ注意機構と線形注意機構は、視覚言語モデル（VLM）における二次計算量の課題とKVキャッシュの肥大化を緩和する二つの主要な戦略である。しかし、ウィンドウベースのVLMは系列長がウィンドウサイズを超えると性能劣化が生じ、線形注意はOCRや文書理解といった情報集約型タスクで精度が低下することが観察される。これらの制約を克服するため、我々はスライディングウィンドウ注意（SWA）とGated DeltaNetを統合した線形計算量のVLMアーキテクチャ「InfiniteVL」を提案する。限られたリソース下で競争力のあるマルチモーダル性能を実現するため、知識蒸留を用いた事前学習、指示チューニング、長系列SFTの3段階からなる学習戦略を設計した。注目すべきは、主要VLMが必要とする学習データの2%未満を用いても、InfiniteVLが従来の線形計算量VLMを大幅に上回るだけでなく、トップレベルのTransformerベースVLMと同等の性能を達成し、長期記憶保持の有効性を実証した点である。FlashAttention-2で高速化した同規模のTransformerベースVLMと比較すると、InfiniteVLは推論速度で3.6倍以上を達成し、レイテンシとメモリ使用量を一定に保つ。ストリーミング動画理解タスクでは、長期記憶キャッシュを維持しながら24 FPSの安定したリアルタイムプリフィル速度を維持する。コードとモデルはhttps://github.com/hustvl/InfiniteVLで公開されている。

English

Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

InfiniteVL: 線形注意とスパース注意の融合による高効率・無制限入力の視覚言語モデル

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

要旨

Support