MoAI: Miscela di Tutte le Intelligenze per Modelli Linguistici e Visivi di Grande Scala

Abstract

L'ascesa dei grandi modelli linguistici (LLM) e del tuning su istruzioni ha portato all'attuale tendenza dei grandi modelli linguistici e visivi sintonizzati su istruzioni (LLVM). Questa tendenza implica la meticolosa cura di numerosi dataset di tuning su istruzioni specificamente progettati per obiettivi particolari o l'ampliamento degli LLVM per gestire grandi quantità di dati visivo-linguistici (VL). Tuttavia, gli attuali LLVM hanno trascurato la comprensione dettagliata e completa delle scene del mondo reale disponibile dai modelli specializzati di visione artificiale (CV) in compiti di percezione visiva come la segmentazione, il rilevamento, la generazione di grafi di scene (SGG) e il riconoscimento ottico dei caratteri (OCR). Invece, gli LLVM esistenti si basano principalmente sulla grande capacità e sulle capacità emergenti dei loro backbone LLM. Pertanto, presentiamo un nuovo LLVM, Mixture of All Intelligence (MoAI), che sfrutta le informazioni visive ausiliarie ottenute dagli output di modelli esterni di segmentazione, rilevamento, SGG e OCR. MoAI opera attraverso due nuovi moduli introdotti: MoAI-Compressor e MoAI-Mixer. Dopo aver verbalizzato gli output dei modelli CV esterni, il MoAI-Compressor li allinea e li condensa per utilizzare in modo efficiente le informazioni visive ausiliarie rilevanti per i compiti VL. MoAI-Mixer combina quindi tre tipi di intelligenza: (1) caratteristiche visive, (2) caratteristiche ausiliarie dai modelli CV esterni e (3) caratteristiche linguistiche, utilizzando il concetto di Mixture of Experts. Attraverso questa integrazione, MoAI supera significativamente sia gli LLVM open-source che quelli closed-source in numerosi compiti VL zero-shot, in particolare quelli relativi alla comprensione delle scene del mondo reale come l'esistenza degli oggetti, le posizioni, le relazioni e l'OCR, senza ampliare le dimensioni del modello o curare ulteriori dataset di tuning su istruzioni visive.

English

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

MoAI: Miscela di Tutte le Intelligenze per Modelli Linguistici e Visivi di Grande Scala

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Abstract

Support