지속적 시각 기억: 대규모 시각 언어 모델의 심층 생성을 위한 인지 유지

초록

자율회귀적 대규모 시각-언어 모델(LVLM)은 다중모달 작업에서 뛰어난 능력을 보여주지만, '시각 신호 희석' 현상에 직면합니다. 이는 텍스트 기록의 누적으로 인해 어텐션 분할 함수가 확장되며 생성 시퀀스 길이에 반비례하여 시각 어텐션이 약화되는 현상입니다. 이를 해결하기 위해 우리는 지속적이고 요구에 따른 시각 인식을 보장하기 위한 경량 학습 가능 모듈인 지속적 시각 메모리(PVM)를 제안합니다. LVLM의 피드포워드 네트워크(FFN)와 병렬 브랜치로 통합된 PVM은 거리 영향에 무관한 검색 경로를 구축하여 정확한 시각 인식을 위한 시각 임베딩을 직접 제공함으로써, 깊은 생성 과정에 내재된 신호 억제를 구조적으로 완화합니다. Qwen3-VL 모델에 대한 폭넓은 실험 결과, PVM이 매개변수 오버헤드는 무시할 수준이면서도 뚜렷한 성능 향상을 가져오며, 특히 지속적 시각 인식을 요구하는 복잡한 추론 작업에서 4B와 8B 규모 모두에서 일관된 평균 정확도 상승을 보였습니다. 더 나아가 심층 분석 결과 PVM이 길이 유도 신호 감쇠에 저항하며 내부 예측 수렴을 가속화할 수 있음이 확인되었습니다.

English

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

지속적 시각 기억: 대규모 시각 언어 모델의 심층 생성을 위한 인지 유지

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

초록

Support