共融共进:利用非配对多模态数据强化单模态模型
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
October 9, 2025
作者: Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, Phillip Isola
cs.AI
摘要
传统多模态学习者在处理如视觉问答等任务时,寻求统一的表征方式,但严重依赖于成对的数据集。然而,一个被忽视却可能极具潜力的问题是:能否利用辅助的非配对多模态数据,直接增强目标模态中的表征学习?我们引入了UML:非配对多模态学习者,这是一种模态无关的训练范式,其中单一模型交替处理来自不同模态的输入,同时在这些模态间共享参数。这一设计基于不同模态是共享现实基础的不同投影这一假设,使得模型能够从跨模态结构中获益,而无需明确的配对数据。理论上,在线性数据生成假设下,我们证明了非配对的辅助数据能够产生比单模态训练更严格地反映数据生成过程的表征。实证上,我们展示了使用来自辅助模态(如文本、音频或图像)的非配对数据,能够持续提升跨多种单模态目标(如图像和音频)的下游任务表现。我们的项目页面:https://unpaired-multimodal.github.io/
English
Traditional multimodal learners find unified representations for tasks like
visual question answering, but rely heavily on paired datasets. However, an
overlooked yet potentially powerful question is: can one leverage auxiliary
unpaired multimodal data to directly enhance representation learning in a
target modality? We introduce UML: Unpaired Multimodal Learner, a
modality-agnostic training paradigm in which a single model alternately
processes inputs from different modalities while sharing parameters across
them. This design exploits the assumption that different modalities are
projections of a shared underlying reality, allowing the model to benefit from
cross-modal structure without requiring explicit pairs. Theoretically, under
linear data-generating assumptions, we show that unpaired auxiliary data can
yield representations strictly more informative about the data-generating
process than unimodal training. Empirically, we show that using unpaired data
from auxiliary modalities -- such as text, audio, or images -- consistently
improves downstream performance across diverse unimodal targets such as image
and audio. Our project page: https://unpaired-multimodal.github.io/