LoMo：局部模態替代以實現更深度的視覺-語言融合

摘要

視覺語言模型（VLM）透過大規模圖文訓練以實現多模態融合，已在廣泛的理解與推理任務中取得顯著進展。理想情況下，將文字問題替換為其渲染圖像版本，模型表現應基本不受影響。然而在實務中，此類模態替換會導致模型效能急遽下降。我們將此「載體敏感性」問題歸因於當前訓練語料內在的偏差。在常見的資料集，如圖像描述、VQA、OCR及網路來源的交錯資料中，文字與圖像通常被組織為截然不同的不對稱角色：文字作為語言查詢，圖像作為視覺參考。此類資料偏差使得VLM對於不同模態的資訊獲取表現出明顯偏好。因此，VLM無法對齊語義等價內容在文字與視覺載體間的表示，導致模型推理在模態替換下變得脆弱。為解決此問題，我們提出局部模態替換（LoMo），一種輕量級、無關架構的資料整理範式，旨在為語義等價的文字與圖像載體間的跨模態表徵不變性提供監督。LoMo透過將單模態提示重新構建成無縫交錯的多模態序列來達成此目標。它動態選取目標文字片段並將其重新塑造為渲染圖像，從而在「文字、視覺、文字」載體間保留相同語義。在13個多樣化的多模態基準上進行的廣泛實驗表明，LoMo顯著改善整體多模態推理並帶來更深入的跨模態融合。具體而言，它在基礎模型上帶來一致增益，在LLaVA-OneVision-1.5-8B上比標準SFT提升2.67分，在Qwen3.5-9B上提升2.82分。

English

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.