FUSION: Volledige Integratie van Visueel-Taalrepresentaties voor Diepgaand Cross-Modaal Begrip

Samenvatting

We introduceren FUSION, een familie van multimodale grote taalmodellen (MLLMs) met een volledig visie-taal uitlijnings- en integratieparadigma. In tegenstelling tot bestaande methoden die voornamelijk vertrouwen op late-stadium modaliteitsinteractie tijdens LLM-decodering, bereikt onze aanpak een diepe, dynamische integratie gedurende de hele verwerkingspijplijn. Hiertoe stellen we Text-Guided Unified Vision Encoding voor, waarbij tekstuele informatie wordt opgenomen in visuele codering om pixelniveau-integratie te bereiken. We ontwerpen verder Context-Aware Recursive Alignment Decoding dat visuele kenmerken recursief aggregeert, geconditioneerd op tekstuele context tijdens decodering, waardoor fijnmazige, vraagniveau semantische integratie mogelijk wordt. Om kenmerkmapping te begeleiden en modaliteitsdiscrepanties te mitigeren, ontwikkelen we Dual-Supervised Semantic Mapping Loss. Daarnaast construeren we een Synthesized Language-Driven Question-Answer (QA) dataset via een nieuwe datasynthesemethode, waarbij we prioriteit geven aan hoogwaardige QA-paren om tekstgeleide kenmerkintegratie te optimaliseren. Op basis van deze fundamenten trainen we FUSION op twee schalen-3B, 8B-en demonstreren we dat onze volledige modaliteitsintegratieaanpak bestaande methoden significant overtreft met slechts 630 visuele tokens. Opmerkelijk is dat FUSION 3B Cambrian-1 8B en Florence-VL 8B op de meeste benchmarks overtreft. FUSION 3B blijft Cambrian-1 8B overtreffen, zelfs wanneer beperkt tot 300 visuele tokens. Onze ablatiestudies tonen aan dat FUSION LLaVA-NeXT op meer dan de helft van de benchmarks overtreft onder dezelfde configuratie zonder dynamische resolutie, wat de effectiviteit van onze aanpak benadrukt. We geven onze code, modelgewichten en dataset vrij. https://github.com/starriver030515/FUSION

English

We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

FUSION: Volledige Integratie van Visueel-Taalrepresentaties voor Diepgaand Cross-Modaal Begrip

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Samenvatting

Support