持久視覺記憶：為大型視覺語言模型中的深度生成維持感知能力

摘要

雖然自迴歸大型視覺語言模型在多模態任務中展現出卓越能力，但其面臨「視覺信號稀釋」現象——文本歷史的積累會擴展注意力分配函數，導致視覺注意力隨生成序列長度增加而呈反比衰減。為應對此問題，我們提出持續視覺記憶模組，這款輕量級可學習組件旨在實現持續按需的視覺感知。該模組作為並行分支與前饋網絡集成於視覺語言模型中，建立距離無關的檢索路徑，直接提供視覺嵌入以實現精確視覺感知，從而從結構上緩解深度生成固有的信號抑制問題。在Qwen3-VL模型上的大量實驗表明，該模組能以可忽略的參數開銷帶來顯著提升，在4B和8B規模上均實現持續的平均準確率增益，尤其在需要持續視覺感知的複雜推理任務中表現突出。深入分析進一步揭示，該模組能有效抵抗長度誘導的信號衰減，並加速內部預測收斂。

English

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

持久視覺記憶：為大型視覺語言模型中的深度生成維持感知能力

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

摘要

Support