LoMo: 지역적 양식 대체를 통한 심층 시각-언어 융합

초록

비전-언어 모델(VLM)은 멀티모달 융합을 목표로 하는 대규모 이미지-텍스트 훈련을 통해 다양한 이해 및 추론 작업에서 상당한 진전을 이루었다. 이상적으로는, 텍스트 질문을 렌더링된 이미지 버전으로 대체해도 모델 성능이 본질적으로 영향을 받지 않아야 한다. 그러나 실제로 이러한 모달리티 대체는 극적인 성능 저하를 유발한다. 우리는 이러한 '운반체 민감성' 문제를 현재 훈련 코퍼스에 내재된 편향에 기인한다고 본다. 이미지 캡셔닝, VQA, OCR, 웹 기반 인터리브 데이터와 같은 일반적인 데이터셋에서 텍스트와 이미지는 보통 뚜렷하고 비대칭적인 역할로 조직되며, 텍스트는 언어적 질의로, 이미지는 시각적 참조로 기능한다. 이러한 데이터 편향은 VLM이 서로 다른 모달리티 간 정보 획득에 대해 뚜렷한 선호도를 보이게 만든다. 결과적으로 VLM은 텍스트와 시각적 운반체 간 의미적으로 동등한 내용의 표현을 정렬하지 못하여 모달리티 대체 하에서 모델 추론이 취약해진다. 이를 해결하기 위해, 우리는 의미적으로 동등한 텍스트와 이미지 운반체 간 교차 모달 표현 불변성에 대한 감독을 제공하도록 설계된 경량 아키텍처 독립적 데이터 큐레이션 패러다임인 로컬 모달리티 대체(LoMo)를 제안한다. LoMo는 단일 모달리티 프롬프트를 매끄럽게 인터리브된 멀티모달 시퀀스로 재구성함으로써 이를 달성한다. 이는 동적으로 대상 텍스트 범위를 선택하여 렌더링된 이미지로 재구성함으로써 '텍스트, 시각, 텍스트' 운반체 간 동일한 의미를 유지한다. 13개의 다양한 멀티모달 벤치마크에 걸친 광범위한 실험을 통해 LoMo가 전반적인 멀티모달 추론을 유의미하게 개선하고 더 깊은 교차 모달 융합을 이끌어냄을 보여준다. 구체적으로, 기반 모델 전반에 걸쳐 일관된 성능 향상을 제공하며, LLaVA-OneVision-1.5-8B에서 표준 SFT 대비 2.67점, Qwen3.5-9B에서 2.82점 향상된다.

English

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.