持久视觉记忆：支撑大型视觉语言模型深度生成的感知持续性

摘要

虽然自回归大型视觉语言模型（LVLM）在多模态任务中展现出卓越能力，但其面临"视觉信号稀释"现象——文本历史累积会扩大注意力分配函数，导致视觉注意力随生成序列长度增加而呈现反向衰减。为应对此问题，我们提出持久视觉记忆（PVM）模块，该轻量级可学习模块通过建立与距离无关的检索路径，直接为精准视觉感知提供嵌入表示，从而从结构上缓解深度生成过程中固有的信号抑制问题。在Qwen3-VL模型上的大量实验表明，PVM能以可忽略的参数开销带来显著性能提升，在4B和8B规模模型上均实现稳定的平均准确率增长，尤其在需要持续视觉感知的复杂推理任务中表现突出。深度分析进一步揭示，PVM能有效抵抗生成长度引发的信号衰减，并加速内部预测收敛。

English

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

持久视觉记忆：支撑大型视觉语言模型深度生成的感知持续性

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

摘要

Support