CoMemo：多模态大模型需要结合图像记忆的上下文信息

摘要

近期，基于大型语言模型构建的大型视觉-语言模型取得了显著进展，将视觉特征与LLM表示对齐已成为主流范式。然而，继承自LLM的架构设计在多模态处理中引入了次优特性。首先，LVLM在注意力分配上呈现双峰分布，导致随着上下文扩展，中间视觉内容逐渐被忽视。其次，传统的定位编码方案在处理动态高分辨率图像时，无法有效保留关键的二维结构关系。为解决这些局限，我们提出了CoMemo——一种双路径架构，结合了上下文图像路径与图像记忆路径进行视觉处理，有效缓解了视觉信息被忽视的问题。此外，我们引入了RoPE-DHR，一种新颖的定位编码机制，通过基于缩略图的定位聚合，在保持二维空间感知的同时，减轻了长序列中的远程衰减效应。在包括长上下文理解、多图像推理及视觉问答在内的七项基准测试中，CoMemo相较于传统LVLM架构展现出了卓越的性能。项目页面详见https://lalbj.github.io/projects/CoMemo/。

English

Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.

CoMemo：多模态大模型需要结合图像记忆的上下文信息

CoMemo: LVLMs Need Image Context with Image Memory

摘要

Support