LoMo: 局所モダリティ置換による深層視覚言語融合

要旨

視覚言語モデル（VLM）は、マルチモーダル融合を目的とした大規模画像テキスト学習により、理解・推論タスクの広範な領域で顕著な進歩を遂げている。理想的には、テキストによる質問をそのレンダリング画像に置き換えても、モデルの性能はほぼ影響を受けないはずである。しかし現実には、そのようなモダリティ置換によって劇的な性能低下が生じる。我々はこの「キャリア感受性」問題を、現在の学習コーパスに内在するバイアスに起因するものと考える。画像キャプショニング、VQA、OCR、Web由来のインタリーブデータといった広く使われるデータセットでは、テキストと画像は通常、明確に区別され非対称な役割に整理されており、テキストは言語クエリ、画像は視覚的な参照として機能する。このようなデータバイアスにより、VLMは異なるモダリティ間で情報取得の明確な選好を示すようになる。その結果、VLMはテキストと画像というキャリア間で意味的に等価な内容の表現を整合できず、モダリティ置換下でのモデル推論が脆弱になる。この問題に対処するため、我々は局所モダリティ置換（LoMo）を提案する。これは軽量でアーキテクチャに依存しないデータキュレーションパラダイムであり、意味的に等価なテキストキャリアと画像キャリア間のクロスモーダル表現不変性を学習するための教師信号を提供する。LoMoは、単一モダリティのプロンプトをシームレスにインタリーブされたマルチモーダル系列に再構成することでこれを実現する。具体的には、対象のテキストスパンを動的に選択し、それをレンダリング画像に変換することで、「テキスト、ビジュアル、テキスト」のキャリア間で同一の意味を保持する。 13種類の多様なマルチモーダルベンチマークを用いた広範な実験により、LoMoが全体のマルチモーダル推論を大幅に改善し、より深いクロスモーダル融合をもたらすことが示された。特に、基盤モデル全体で一貫した性能向上をもたらし、標準的なSFTと比較して、LLaVA-OneVision-1.5-8Bで2.67ポイント、Qwen3.5-9Bで2.82ポイントの改善を達成した。

English

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.