CoMemo: LVLM에 이미지 메모리와 함께 이미지 컨텍스트가 필요하다

초록

대형 언어 모델(Large Language Models, LLMs)을 기반으로 구축된 대형 시각-언어 모델(Large Vision-Language Models, LVLMs)의 최근 발전은 시각적 특징과 LLM 표현을 정렬하는 것을 주요 패러다임으로 확립했습니다. 그러나 상속된 LLM 아키텍처 설계는 다중 모달 처리에 있어 최적이 아닌 특성을 도입합니다. 첫째, LVLMs는 주의 할당에서 이중 모드 분포를 보이며, 이는 컨텍스트가 확장됨에 따라 중간 시각적 콘텐츠가 점진적으로 무시되게 만듭니다. 둘째, 기존의 위치 인코딩 방식은 동적 고해상도 이미지를 처리할 때 중요한 2D 구조적 관계를 보존하지 못합니다. 이러한 한계를 해결하기 위해, 우리는 CoMemo를 제안합니다. CoMemo는 시각적 처리에 컨텍스트 이미지 경로와 이미지 메모리 경로를 결합한 이중 경로 아키텍처로, 시각적 정보의 무시를 효과적으로 완화합니다. 또한, RoPE-DHR이라는 새로운 위치 인코딩 메커니즘을 도입하여, 확장된 시퀀스에서 원격 감쇠를 완화하면서도 2D 공간 인식을 유지하기 위해 썸네일 기반 위치 집계를 사용합니다. 장문 컨텍스트 이해, 다중 이미지 추론, 시각적 질문 응답을 포함한 7가지 벤치마크에서의 평가는 CoMemo가 기존 LVLM 아키텍처에 비해 우수한 성능을 보임을 입증합니다. 프로젝트 페이지는 https://lalbj.github.io/projects/CoMemo/에서 확인할 수 있습니다.

English

Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.

CoMemo: LVLM에 이미지 메모리와 함께 이미지 컨텍스트가 필요하다

CoMemo: LVLMs Need Image Context with Image Memory

초록

Support