다중모달 기반 모델: 전문가에서 범용 어시스턴트로의 진화

초록

본 논문은 시각 및 시각-언어 능력을 보여주는 멀티모달 파운데이션 모델의 분류체계와 진화에 대한 포괄적인 조사를 제시하며, 특수 목적 모델에서 범용 어시스턴트로의 전환에 초점을 맞춥니다. 연구 영역은 두 가지 범주로 나뉜 다섯 가지 핵심 주제를 포함합니다. (i) 먼저, 잘 정립된 연구 분야에 대한 조사를 시작합니다: 특정 목적을 위해 사전 학습된 멀티모달 파운데이션 모델로, 시각 이해를 위한 시각 백본 학습 방법과 텍스트-이미지 생성이라는 두 가지 주제를 포함합니다. (ii) 그런 다음, 탐구적이고 개방된 연구 분야의 최근 발전을 소개합니다: 범용 어시스턴트 역할을 목표로 하는 멀티모달 파운데이션 모델로, 대형 언어 모델(LLM)에서 영감을 받은 통합 시각 모델, 멀티모달 LLM의 종단간 학습, 그리고 멀티모달 도구와 LLM의 연결이라는 세 가지 주제를 포함합니다. 본 논문의 대상 독자는 멀티모달 파운데이션 모델의 기초와 최신 동향을 배우고자 하는 컴퓨터 비전 및 시각-언어 멀티모달 커뮤니티의 연구자, 대학원생, 전문가들입니다.

English

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

다중모달 기반 모델: 전문가에서 범용 어시스턴트로의 진화

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

초록

Support