逆LLaVA：通过文本到视觉映射消除对齐预训练

摘要

传统的多模态学习方法需要通过昂贵的对齐预训练来桥接视觉与语言模态，通常将视觉特征投影到离散的文本标记空间中。我们通过提出Inverse-LLaVA这一新方法，挑战了这一范式背后的两个基本假设，彻底消除了对齐预训练的需求，并反转了传统的映射方向。我们的方法不是将视觉特征投影到文本空间，而是将文本嵌入映射到连续的视觉表示空间，并在Transformer的中间层进行融合。通过在注意力机制中引入选择性加性组件，我们实现了视觉与文本表示的动态整合，而无需依赖大规模图像-文本对齐数据集。在九个多模态基准上的全面实验展示了性能的微妙权衡：Inverse-LLaVA在推理密集型和认知任务上取得了显著提升（MM-VET：+0.2%，VizWiz：+1.8%，ScienceQA：+0.2%，认知推理：+27.2%），而在需要记忆视觉-文本关联的感知任务上则出现了预期的下降（名人识别：-49.5%，OCR：-21.3%）。这些结果首次提供了实证证据，表明对齐预训练对于有效的多模态学习并非必需，尤其是在复杂推理任务中。我们的工作确立了一种新范式的可行性，该范式将计算需求减少了45%，挑战了关于模态融合的传统观念，并为保留模态特定特征的高效多模态架构开辟了新的研究方向。我们的项目网站提供了代码和额外资源，访问地址为https://inverse-llava.github.io。

English

Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.

逆LLaVA：通过文本到视觉映射消除对齐预训练

Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

摘要

Support