FUSION: 심층적인 크로스모달 이해를 위한 시각-언어 표현의 완전 통합

초록

우리는 완전한 시각-언어 정렬 및 통합 패러다임을 갖춘 멀티모달 대형 언어 모델(MLLM) 패밀리인 FUSION을 소개한다. 기존 방법들이 주로 LLM 디코딩 단계에서 후기적 모달리티 상호작용에 의존하는 것과 달리, 우리의 접근 방식은 전체 처리 파이프라인에 걸쳐 깊고 동적인 통합을 달성한다. 이를 위해, 우리는 텍스트 정보를 시각 인코딩에 통합하여 픽셀 수준의 통합을 이루는 Text-Guided Unified Vision Encoding을 제안한다. 또한, 디코딩 과정에서 텍스트 컨텍스트에 기반하여 시각적 특징을 재귀적으로 집계하는 Context-Aware Recursive Alignment Decoding을 설계하여, 세밀한 질문 수준의 의미론적 통합을 가능하게 한다. 특징 매핑을 안내하고 모달리티 간 불일치를 완화하기 위해, 우리는 Dual-Supervised Semantic Mapping Loss를 개발했다. 추가적으로, 새로운 데이터 합성 방법을 통해 Synthesized Language-Driven Question-Answer (QA) 데이터셋을 구축하여, 텍스트 기반 특징 통합을 최적화하기 위해 고품질 QA 쌍을 우선시했다. 이러한 기반 위에, 우리는 3B와 8B 두 규모로 FUSION을 학습시키고, 전체 모달리티 통합 접근 방식이 단 630개의 시각 토큰만으로도 기존 방법들을 크게 능가함을 입증했다. 특히, FUSION 3B는 대부분의 벤치마크에서 Cambrian-1 8B와 Florence-VL 8B를 능가했다. FUSION 3B는 시각 토큰을 300개로 제한하더라도 Cambrian-1 8B를 계속해서 능가했다. 우리의 절제 연구는 FUSION이 동적 해상도 없이 동일한 구성에서 LLaVA-NeXT를 절반 이상의 벤치마크에서 능가함을 보여주며, 우리 접근 방식의 효과를 강조한다. 우리는 코드, 모델 가중치, 데이터셋을 공개한다. https://github.com/starriver030515/FUSION

English

We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

FUSION: 심층적인 크로스모달 이해를 위한 시각-언어 표현의 완전 통합

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

초록

Support