MoAI: 대규모 언어 및 비전 모델을 위한 통합 지능 혼합체

초록

대규모 언어 모델(LLM)과 명령어 튜닝의 부상은 현재 명령어 튜닝된 대규모 언어 및 비전 모델(LLVM)의 트렌드로 이어졌습니다. 이 트렌드는 특정 목표에 맞춰 수많은 명령어 튜닝 데이터셋을 세심하게 선별하거나, 방대한 양의 비전 언어(VL) 데이터를 처리하기 위해 LLVM의 규모를 확장하는 것을 포함합니다. 그러나 현재의 LLVM은 세분화, 탐지, 장면 그래프 생성(SGG), 광학 문자 인식(OCR)과 같은 시각 인식 작업에서 전문 컴퓨터 비전(CV) 모델로부터 얻을 수 있는 세부적이고 포괄적인 실세계 장면 이해를 간과하고 있습니다. 대신, 기존의 LLVM은 주로 LLM 백본의 대규모 용량과 부상하는 능력에 의존하고 있습니다. 따라서 우리는 외부 세분화, 탐지, SGG, OCR 모델의 출력에서 얻은 보조 시각 정보를 활용하는 새로운 LLVM인 '모든 지능의 혼합(MoAI)'을 제안합니다. MoAI는 새롭게 도입된 두 가지 모듈인 MoAI-Compressor와 MoAI-Mixer를 통해 작동합니다. 외부 CV 모델의 출력을 언어화한 후, MoAI-Compressor는 이를 정렬하고 압축하여 VL 작업에 관련된 보조 시각 정보를 효율적으로 사용합니다. MoAI-Mixer는 '전문가 혼합(Mixture of Experts)' 개념을 활용하여 (1) 시각 특징, (2) 외부 CV 모델의 보조 특징, (3) 언어 특징이라는 세 가지 유형의 지능을 혼합합니다. 이러한 통합을 통해 MoAI는 모델 크기를 확장하거나 추가적인 시각 명령어 튜닝 데이터셋을 선별하지 않고도, 특히 객체 존재, 위치, 관계, OCR과 같은 실세계 장면 이해와 관련된 다양한 제로샷 VL 작업에서 오픈소스 및 클로즈드소스 LLVM을 크게 능가합니다.

English

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

MoAI: 대규모 언어 및 비전 모델을 위한 통합 지능 혼합체

MoAI: Mixture of All Intelligence for Large Language and Vision Models

초록

Support