Inverse-LLaVA: Het elimineren van uitlijning voorafgaande training door middel van tekst-naar-visie mapping

Samenvatting

Traditionele multimodale leerbenaderingen vereisen kostbare uitlijningsvoorbereiding om visuele en taalkundige modaliteiten te verbinden, waarbij visuele kenmerken typisch worden geprojecteerd in discrete teksttokenruimtes. We dagen beide fundamentele aannames van dit paradigma uit door Inverse-LLaVA voor te stellen, een nieuwe benadering die de uitlijningsvoorbereiding volledig elimineert terwijl de conventionele mappingrichting wordt omgekeerd. In plaats van visuele kenmerken naar tekstruimte te projecteren, mapt onze methode tekstembeddings naar een continue visuele representatieruimte en voert fusie uit binnen transformer-tussenlagen. Door selectieve additieve componenten in aandachtmechanismen, maken we dynamische integratie van visuele en tekstuele representaties mogelijk zonder enorme beeld-tekst uitlijningsdatasets nodig te hebben. Uitgebreide experimenten over negen multimodale benchmarks tonen genuanceerde prestatieafwegingen: Inverse-LLaVA behaalt opmerkelijke verbeteringen op redeneerintensieve en cognitieve taken (MM-VET: +0,2%, VizWiz: +1,8%, ScienceQA: +0,2%, cognitief redeneren: +27,2%), terwijl verwachte afnames worden getoond in perceptietaken die gememoriseerde visueel-tekst associaties vereisen (herkenning van beroemdheden: -49,5%, OCR: -21,3%). Deze resultaten leveren het eerste empirische bewijs dat uitlijningsvoorbereiding niet noodzakelijk is voor effectief multimodaal leren, met name voor complexe redeneertaken. Ons werk toont de haalbaarheid van een nieuw paradigma dat de computationele vereisten met 45% vermindert, conventionele wijsheid over modaliteitsfusie uitdaagt, en nieuwe onderzoeksrichtingen opent voor efficiënte multimodale architecturen die modaliteit-specifieke kenmerken behouden. Onze projectwebsite met code en aanvullende bronnen is beschikbaar op https://inverse-llava.github.io.

English

Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.

Inverse-LLaVA: Het elimineren van uitlijning voorafgaande training door middel van tekst-naar-visie mapping

Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

Samenvatting

Support