CoMemo：多模态大模型需要结合图像记忆的上下文信息

摘要

基於大型語言模型的大型視覺-語言模型（LVLM）近期取得了顯著進展，將視覺特徵與LLM表徵對齊已成為主流範式。然而，繼承自LLM的架構設計在多模態處理中引入了次優特性。首先，LVLM在注意力分配上呈現雙峰分佈，隨著上下文擴展，中間視覺內容逐漸被忽視。其次，傳統的位置編碼方案在處理動態高分辨率圖像時，無法有效保留關鍵的二維結構關係。為解決這些限制，我們提出了CoMemo——一種雙路徑架構，結合了上下文圖像路徑和圖像記憶路徑進行視覺處理，有效緩解了視覺信息被忽視的問題。此外，我們引入了RoPE-DHR，這是一種新穎的位置編碼機制，通過基於縮略圖的位置聚合來維持二維空間感知，同時減輕長序列中的遠程衰減。在包括長上下文理解、多圖像推理和視覺問答在內的七個基準測試中，CoMemo相較於傳統的LVLM架構展現了卓越的性能。項目頁面請訪問https://lalbj.github.io/projects/CoMemo/。

English

Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.

CoMemo：多模态大模型需要结合图像记忆的上下文信息

CoMemo: LVLMs Need Image Context with Image Memory

摘要

Support