異方性モダリティアラインメント

要旨

マルチモーダル大規模言語モデルの訓練は、従来、高品質なペア化されたマルチモーダルデータの不足によって制約されてきた。近年の研究では、事前学習済みマルチモーダル対比モデルの共有表現空間が橋渡しとして機能し、単一モダリティのデータを用いたマルチモーダル訓練を可能にすることが示されている。しかし、このパラダイムの鍵となる前提、すなわち「異なるモダリティからの表現を確実に相互交換できるか」については、未だ十分に理解されていない。その核心的な障壁は、共有空間に持続的に存在する「モダリティギャップ」にある。本研究では、モダリティギャップの幾何学的性質を再検討する。その結果、モダリティ表現はすでに互換性のある支配的な意味的幾何構造を共有していることがわかった。モダリティの相互交換性を真に妨げているのは、単純な全体的シフトではなく、少数の支配的な方向に集中した異方性の残差構造である。この発見に基づき、さらに「異方性モダリティギャップアラインメントの原理」を提案する。すなわち、効果的なモダリティアラインメントは、ソースモダリティの意味的構造を保持しつつ、ターゲットモダリティの分布に整合するべきである。この原理に導かれ、非ペア型モダリティアラインメントのための異方性幾何補正フレームワーク「AnisoAlign」を提案する。本フレームワークは、ターゲットモダリティの内部幾何学的事前分布を活用し、ソースモダリティの表現に有界補正を施すことで、ターゲットモダリティにおける代替表現を構築する。実験により、幾何学的診断とテキストのみを用いたMLLM訓練の両方において、その利点が確認された。総じて、本研究はモダリティギャップを経験的観察から補正可能な構造化された幾何学現象へと再定義し、単一モダリティデータを用いたマルチモーダルモデル訓練のための新たな表現アラインメントの視点を提供する。

English

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.

異方性モダリティアラインメント

Anisotropic Modality Align

要旨

Support