ヘッドに注意: マルチモーダルLLMのための位相的表現アライメント

要旨

表現アライメントは、マルチモーダル大規模言語モデル（MLLM）の内部表現を外部視覚エンコーダのそれに正則化することで、MLLMを改善する効果的なアプローチとして登場した。しかし、既存の手法は典型的に言語バックボーンの固定層をアライメントするのみで、Transformerモデルの細かい構造を見過ごしている。本研究では、ヘッド単位表現アライメント（HeRA）を提案する。これは個々のアテンションヘッドレベルでクロスモーダルアライメントを強制する手法である。我々のアプローチはプラトン的表現仮説に基づいており、モダリティ間での表現のトポロジカル構造（すなわち局所近傍関係）を保存することに焦点を当てている。相互K近傍法（MKNN）アライメント指標に従い、局所構造をマッチングするための微分可能な代理として機能する対照的目的関数を導入する。HeRAはこの目的関数をマルチモーダル学習中に、MKNN指標によるアライメントスコアに基づいて選択されたLLM内の特定のアテンションヘッドに適用する。直観に反して、最もアライメントが低いヘッドをアライメントすることで最大の改善が得られることを発見した。複数のMLLMと18のベンチマークにわたる広範な評価により、HeRAが挑戦的な視覚中心タスクで一貫して性能を向上させ、言語的先行知識への過剰依存を自然に抑制することで視覚的ハルシネーションに対する効果的な正則化器として機能することが示された。我々のコードは公開されている。

English

Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors. Our code is publicly released.