逆LLaVA：通過文本到視覺映射消除對齊預訓練

摘要

傳統的多模態學習方法需要昂貴的對齊預訓練來橋接視覺和語言模態，通常將視覺特徵投影到離散的文本標記空間中。我們通過提出Inverse-LLaVA，挑戰了這一範式的兩個基本假設，該方法完全消除了對齊預訓練，並反轉了傳統的映射方向。與其將視覺特徵投影到文本空間，我們的方法將文本嵌入映射到連續的視覺表示空間，並在Transformer的中間層進行融合。通過在注意力機制中選擇性地添加組件，我們實現了視覺和文本表示的動態整合，而無需大規模的圖像-文本對齊數據集。在九個多模態基準上的全面實驗展示了細微的性能權衡：Inverse-LLaVA在推理密集型和認知任務上取得了顯著提升（MM-VET：+0.2%，VizWiz：+1.8%，ScienceQA：+0.2%，認知推理：+27.2%），而在需要記憶視覺-文本關聯的感知任務上則出現了預期的下降（名人識別：-49.5%，OCR：-21.3%）。這些結果首次提供了對齊預訓練對於有效的多模態學習並非必要的實證證據，特別是對於複雜的推理任務。我們的工作確立了一種新範式的可行性，該範式將計算需求減少了45%，挑戰了關於模態融合的傳統智慧，並為保留模態特定特徵的高效多模態架構開闢了新的研究方向。我們的項目網站提供了代碼和其他資源，網址為https://inverse-llava.github.io。

English

Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.

逆LLaVA：通過文本到視覺映射消除對齊預訓練

Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

摘要

Support