MoAI：大型语言和视觉模型的混合智能

摘要

大型语言模型（LLMs）和指导调整的兴起导致了当前指导调整的大型语言和视觉模型（LLVMs）的趋势。这一趋势要么精心策划大量针对特定目标量身定制的指导调整数据集，要么扩大LLVMs以处理大量的视觉语言（VL）数据。然而，当前的LLVMs忽视了专门的计算机视觉（CV）模型在视觉感知任务（如分割、检测、场景图生成（SGG）和光学字符识别（OCR））中提供的详细和全面的真实世界场景理解。相反，现有的LLVMs主要依赖于它们的LLM骨干的大容量和新兴能力。因此，我们提出了一种新的LLVM，即全智能混合（MoAI），它利用从外部分割、检测、SGG和OCR模型的输出获得的辅助视觉信息。MoAI通过两个新引入的模块运行：MoAI-Compressor和MoAI-Mixer。在将外部CV模型的输出转化为语言后，MoAI-Compressor对其进行对齐和压缩，以有效利用相关的辅助视觉信息用于VL任务。MoAI-Mixer然后通过利用专家混合概念将三种智能（1）视觉特征，（2）来自外部CV模型的辅助特征和（3）语言特征混合在一起。通过这种整合，MoAI在许多零样本VL任务中显着优于开源和闭源LLVMs，特别是那些涉及对象存在、位置、关系和OCR等真实世界场景理解的任务，而无需增加模型大小或策划额外的视觉指导调整数据集。

English

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

MoAI：大型语言和视觉模型的混合智能

MoAI: Mixture of All Intelligence for Large Language and Vision Models

摘要

Support