Inverse-LLaVA: 텍스트-비전 매핑을 통해 정렬 사전 학습 제거

초록

기존의 다중모달 학습 접근법은 시각과 언어 모달리티를 연결하기 위해 비용이 많이 드는 정렬 사전 학습을 필요로 하며, 일반적으로 시각적 특징을 이산적인 텍스트 토큰 공간으로 투영합니다. 우리는 이 패러다임의 근본적인 가정에 도전하며, 정렬 사전 학습을 완전히 제거하고 기존의 매핑 방향을 역전시키는 새로운 접근법인 Inverse-LLaVA를 제안합니다. 시각적 특징을 텍스트 공간으로 투영하는 대신, 우리의 방법은 텍스트 임베딩을 연속적인 시각적 표현 공간으로 매핑하고 트랜스포머 중간 레이어 내에서 융합을 수행합니다. 주의 메커니즘에서 선택적 가산 요소를 통해, 대규모 이미지-텍스트 정렬 데이터셋 없이도 시각적 및 텍스트 표현의 동적 통합을 가능하게 합니다. 9개의 다중모달 벤치마크에 걸친 포괄적인 실험은 미묘한 성능 트레이드오프를 보여줍니다: Inverse-LLaVA는 추론 집약적 및 인지적 작업에서 주목할 만한 개선을 달성했으며(MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, 인지 추론: +27.2%), 기억된 시각-텍스트 연관을 요구하는 지각 작업에서는 예상된 감소를 보였습니다(유명인 인식: -49.5%, OCR: -21.3%). 이러한 결과는 특히 복잡한 추론 작업에서 효과적인 다중모달 학습을 위해 정렬 사전 학습이 필요하지 않다는 첫 번째 실증적 증거를 제공합니다. 우리의 작업은 계산 요구 사항을 45% 줄이고, 모달리티 융합에 대한 기존의 통념에 도전하며, 모달리티 특정 특성을 보존하는 효율적인 다중모달 아키텍처에 대한 새로운 연구 방향을 열어줍니다. 코드 및 추가 리소스가 포함된 프로젝트 웹사이트는 https://inverse-llava.github.io에서 확인할 수 있습니다.

English

Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.

Inverse-LLaVA: 텍스트-비전 매핑을 통해 정렬 사전 학습 제거

Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

초록

Support