Inverse-LLaVA: テキストから視覚へのマッピングによるアライメント事前学習の排除

要旨

従来のマルチモーダル学習アプローチでは、視覚と言語のモダリティを橋渡しするために高コストなアライメント事前学習が必要であり、通常は視覚的特徴を離散的なテキストトークン空間に投影します。本研究では、このパラダイムの根底にある2つの基本的な前提に挑戦し、アライメント事前学習を完全に排除し、従来のマッピング方向を逆転させる新しいアプローチであるInverse-LLaVAを提案します。視覚的特徴をテキスト空間に投影する代わりに、本手法ではテキスト埋め込みを連続的な視覚表現空間にマッピングし、トランスフォーマーの中間層内で融合を行います。アテンションメカニズムにおける選択的な加算コンポーネントを通じて、大規模な画像-テキストアライメントデータセットを必要とせずに、視覚的およびテキスト表現の動的統合を可能にします。9つのマルチモーダルベンチマークにわたる包括的な実験により、微妙なパフォーマンスのトレードオフが示されました：Inverse-LLaVAは、推論集約型および認知タスク（MM-VET: +0.2%、VizWiz: +1.8%、ScienceQA: +0.2%、認知推論: +27.2%）で顕著な改善を達成し、記憶された視覚-テキスト関連付けを必要とする知覚タスク（有名人認識: -49.5%、OCR: -21.3%）では予想される低下を示しました。これらの結果は、特に複雑な推論タスクにおいて、効果的なマルチモーダル学習にアライメント事前学習が不要であることを初めて実証するものです。本研究は、計算要件を45%削減し、モダリティ融合に関する従来の知見に挑戦し、モダリティ固有の特性を保持する効率的なマルチモーダルアーキテクチャの新たな研究方向を開拓する新パラダイムの実現可能性を確立します。コードおよび追加リソースを含むプロジェクトウェブサイトはhttps://inverse-llava.github.ioで公開されています。

English

Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.

Inverse-LLaVA: テキストから視覚へのマッピングによるアライメント事前学習の排除

Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

要旨

Support