MoAI：大型語言和視覺模型的全智能混合

摘要

大型語言模型（LLMs）和指導調整的興起導致了目前指導調整的大型語言和視覺模型（LLVMs）的趨勢。這一趨勢涉及精心策劃許多針對特定目標量身定制的指導調整數據集，或者擴大LLVMs以處理龐大的視覺語言（VL）數據量。然而，目前的LLVMs忽略了從專門的計算機視覺（CV）模型中獲得的詳盡和全面的現實場景理解，這些模型在視覺感知任務中，如分割、檢測、場景圖生成（SGG）和光學字符識別（OCR）方面具有豐富的信息。相反，現有的LLVMs主要依賴於其LLM骨幹的大容量和新興功能。因此，我們提出了一種新的LLVM，即全智能混合體（MoAI），它利用從外部分割、檢測、SGG和OCR模型的輸出中獲得的輔助視覺信息。MoAI通過兩個新引入的模塊運作：MoAI-壓縮器和MoAI-混合器。在將外部CV模型的輸出轉化為文字後，MoAI-壓縮器對其進行對齊和壓縮，以有效地利用相關的輔助視覺信息進行VL任務。MoAI-混合器然後通過利用專家混合的概念，混合三種智能（1）視覺特徵，（2）外部CV模型的輔助特徵，以及（3）語言特徵。通過這種整合，MoAI在眾多零樣本VL任務中明顯優於開源和封閉源LLVMs，特別是與現實場景理解相關的任務，如對象存在、位置、關係和OCR，而無需擴大模型大小或策劃額外的視覺指導調整數據集。

English

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

MoAI：大型語言和視覺模型的全智能混合

MoAI: Mixture of All Intelligence for Large Language and Vision Models

摘要

Support