マルチモーダル基盤モデル：専門家から汎用アシスタントへ

要旨

本論文は、視覚および視覚言語能力を備えたマルチモーダル基盤モデルの分類体系と進化に関する包括的なサーベイを提示し、専門家向けモデルから汎用アシスタントへの移行に焦点を当てています。研究の展望は、2つのクラスに分類される5つの核心的なトピックを網羅しています。(i) まず、特定の目的のために事前学習されたマルチモーダル基盤モデルに関する確立された研究領域のサーベイから始めます。これには、視覚理解のための視覚バックボーンの学習方法とテキストから画像への生成という2つのトピックが含まれます。(ii) 次に、探索的でオープンな研究領域における最近の進展を紹介します。これには、汎用アシスタントの役割を目指すマルチモーダル基盤モデルが含まれ、大規模言語モデル（LLM）にインスパイアされた統一視覚モデル、マルチモーダルLLMのエンドツーエンド学習、LLMとマルチモーダルツールの連携という3つのトピックが取り上げられます。本論文の対象読者は、コンピュータビジョンおよび視覚言語マルチモーダルコミュニティの研究者、大学院生、専門家であり、マルチモーダル基盤モデルの基礎と最近の進展を学びたいと考えている方々です。

English

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

マルチモーダル基盤モデル：専門家から汎用アシスタントへ

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

要旨

Support