逆LLaVA:通過文本到視覺映射消除對齊預訓練
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
August 17, 2025
作者: Xuhui Zhan, Tyler Derr
cs.AI
摘要
傳統的多模態學習方法需要昂貴的對齊預訓練來橋接視覺和語言模態,通常將視覺特徵投影到離散的文本標記空間中。我們通過提出Inverse-LLaVA,挑戰了這一範式的兩個基本假設,該方法完全消除了對齊預訓練,並反轉了傳統的映射方向。與其將視覺特徵投影到文本空間,我們的方法將文本嵌入映射到連續的視覺表示空間,並在Transformer的中間層進行融合。通過在注意力機制中選擇性地添加組件,我們實現了視覺和文本表示的動態整合,而無需大規模的圖像-文本對齊數據集。在九個多模態基準上的全面實驗展示了細微的性能權衡:Inverse-LLaVA在推理密集型和認知任務上取得了顯著提升(MM-VET:+0.2%,VizWiz:+1.8%,ScienceQA:+0.2%,認知推理:+27.2%),而在需要記憶視覺-文本關聯的感知任務上則出現了預期的下降(名人識別:-49.5%,OCR:-21.3%)。這些結果首次提供了對齊預訓練對於有效的多模態學習並非必要的實證證據,特別是對於複雜的推理任務。我們的工作確立了一種新範式的可行性,該範式將計算需求減少了45%,挑戰了關於模態融合的傳統智慧,並為保留模態特定特徵的高效多模態架構開闢了新的研究方向。我們的項目網站提供了代碼和其他資源,網址為https://inverse-llava.github.io。
English
Traditional multimodal learning approaches require expensive alignment
pre-training to bridge vision and language modalities, typically projecting
visual features into discrete text token spaces. We challenge both fundamental
assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel
approach that eliminates alignment pre-training entirely while inverting the
conventional mapping direction. Rather than projecting visual features to text
space, our method maps text embeddings into continuous visual representation
space and performs fusion within transformer intermediate layers. Through
selective additive components in attention mechanisms, we enable dynamic
integration of visual and textual representations without requiring massive
image-text alignment datasets. Comprehensive experiments across nine multimodal
benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves
notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%,
VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing
expected decreases in perception tasks requiring memorized visual-text
associations (celebrity recognition: -49.5%, OCR: -21.3%). These results
provide the first empirical evidence that alignment pre-training is not
necessary for effective multimodal learning, particularly for complex reasoning
tasks. Our work establishes the feasibility of a new paradigm that reduces
computational requirements by 45%, challenges conventional wisdom about
modality fusion, and opens new research directions for efficient multimodal
architectures that preserve modality-specific characteristics. Our project
website with code and additional resources is available at
https://inverse-llava.github.io.