이방성 모달리티 정렬

초록

다중 모달 대규모 언어 모델을 학습하는 것은 오랫동안 고품질의 짝지어진 다중 모달 데이터의 부족으로 인해 제한되어 왔다. 최근 연구들은 사전 학습된 다중 모달 대조 모델의 공유 표현 공간이 브리지 역할을 하여, 단일 모달 데이터로도 다중 모달 학습을 수행할 수 있게 해준다는 것을 보여준다. 그러나 이 패러다임의 핵심 전제는 여전히 충분히 이해되지 않았다: 서로 다른 모달리티의 표현이 신뢰할 수 있게 상호 교환될 수 있는가? 핵심 장애물은 공유 공간에서 지속되는 모달리티 격차(Modality Gap)에 있다. 본 연구에서는 모달리티 격차의 기하학적 성질을 재검토한다. 우리는 모달리티 표현들이 이미 호환 가능한 지배적 의미 기하를 공유하고 있음을 발견한다. 모달리티 상호 교환성을 실제로 방해하는 것은 단순한 전역적 이동이 아니라, 소수의 지배적 방향을 따라 집중된 이방성 잔차 구조이다. 이러한 발견에 기초하여, 우리는 이방성 모달리티 격차 정렬 원칙을 제안한다: 효과적인 모달리티 정렬은 원천 모달리티의 의미 구조를 보존하면서 대상 모달리티 분포에 맞춰 정렬되어야 한다. 이 원칙에 따라, 우리는 짝지어지지 않은 모달리티 정렬을 위한 이방성 기하 보정 프레임워크인 AnisoAlign을 제안한다. 이 프레임워크는 대상 모달리티의 내부 기하 사전 정보를 활용하여 원천 모달리티 표현에 제한된 보정을 수행함으로써, 대상 모달리티에서 대체 표현을 구축한다. 실험 결과는 기하 진단과 텍스트 전용 MLLM 학습 모두에서 이점을 확인한다. 전반적으로, 본 연구는 모달리티 격차를 경험적 관찰에서 교정 가능한 구조화된 기하 현상으로 재정의하고, 단일 모달 데이터로 다중 모달 모델을 학습하기 위한 새로운 표현 정렬 관점을 제공한다.

English

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.