CoMemo：LVLMには画像メモリを伴う画像コンテキストが必要

要旨

大規模言語モデルを基盤とした大規模視覚言語モデルの最近の進展により、視覚的特徴と言語モデルの表現を整合させることが主流のパラダイムとして確立されました。しかし、継承された言語モデルのアーキテクチャ設計は、マルチモーダル処理において最適とは言えない特性を導入しています。第一に、大規模視覚言語モデルは注意配分において二峰性分布を示し、文脈が拡大するにつれて中間の視覚内容が徐々に無視される傾向があります。第二に、従来の位置符号化スキームは、動的な高解像度画像を処理する際に重要な2次元構造的関係を保持できません。これらの制限に対処するため、我々はCoMemoを提案します。これは、視覚処理のためにコンテキスト画像パスと画像メモリパスを組み合わせたデュアルパスアーキテクチャであり、視覚情報の無視を効果的に軽減します。さらに、RoPE-DHRという新しい位置符号化メカニズムを導入します。これは、サムネイルベースの位置集約を用いて、2次元空間認識を維持しつつ、長いシーケンスにおける遠隔減衰を緩和します。長文脈理解、複数画像推論、視覚質問応答を含む7つのベンチマークでの評価により、CoMemoが従来の大規模視覚言語モデルアーキテクチャと比較して優れた性能を示すことが実証されました。プロジェクトページはhttps://lalbj.github.io/projects/CoMemo/で公開されています。

English

Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.

CoMemo：LVLMには画像メモリを伴う画像コンテキストが必要

CoMemo: LVLMs Need Image Context with Image Memory

要旨

Support