LoMo: 局部模态替代以实现更深层次的视觉-语言融合

摘要

视觉语言模型（VLM）在各类理解与推理任务中取得了显著进展，这得益于大规模图文训练旨在实现多模态融合。理想情况下，将文本问题替换为其渲染图像形式，模型性能应基本不受影响。然而在实践中，这种模态替换却导致了性能严重下降。我们将这种“载体敏感性”问题归因于当前训练语料中固有的偏置。在图像描述、VQA、OCR以及网络来源的交错数据等主流数据集中，文本与图像通常被组织为截然不同且不对称的角色：文本作为语言查询，图像作为视觉参考。这种数据偏置导致VLM对不同模态的信息获取呈现明显偏好。因此，VLM无法在文本与视觉载体上对齐语义等价内容的表征，使得模型推理在模态替换下变得脆弱。为解决此问题，我们提出局部模态替换（Local Modality Substitution, LoMo），这是一种轻量级、与架构无关的数据整理范式，旨在为语义等价的文本与图像载体之间的跨模态表征不变性提供监督。LoMo通过将单模态提示重新组织为无缝交错的图文序列来实现这一点。它动态选择目标文本片段，并将其重构为渲染图像，从而在“文本-图像-文本”载体间保持相同语义。在13个多样化多模态基准上的大量实验表明，LoMo显著提升了整体多模态推理能力，并实现了更深层次的跨模态融合。具体来说，它在多个基础模型上均取得了一致提升：在LLaVA-OneVision-1.5-8B上相比标准SFT提升了2.67个百分点，在Qwen3.5-9B上提升了2.82个百分点。

English

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.