永続的視覚メモリ：大規模視覚言語モデルにおける深層生成のための持続的知覚

要旨

自己回帰型大規模視覚言語モデル（LVLM）はマルチモーダルタスクにおいて顕著な能力を示す一方で、「視覚信号希薄化」という現象に直面している。これは、テキスト履歴の蓄積により注意力分配関数が拡大し、視覚的注意が生成シーケンス長に反比例して減衰する現象である。この問題に対処するため、我々は持続的視覚メモリ（PVM）を提案する。これは軽量な学習可能モジュールであり、持続的かつオンデマンドな視覚知覚を保証するように設計されている。LVLMのフィードフォワードネットワーク（FFN）と並列分岐として統合されるPVMは、距離に依存しない検索経路を確立し、精密な視覚知覚のために直接視覚埋め込みを提供することで、深層生成に内在する信号抑制を構造的に緩和する。Qwen3-VLモデルを用いた大規模実験により、PVMがパラメータオーバーヘッドを無視できる程度に抑えつつ、特に持続的視覚知覚を要する複雑な推論タスクにおいて、4Bおよび8Bスケールの両方で一貫した平均精度向上をもたらすことが実証された。さらに詳細な分析により、PVMが長さ誘導性の信号減衰に耐性を持ち、内部予測の収束を加速できることが明らかとなった。

English

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

永続的視覚メモリ：大規模視覚言語モデルにおける深層生成のための持続的知覚

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

要旨

Support