多模基礎模型：從專家到通用助手

摘要

本文提供了一份全面調查，探討展示視覺和視覺語言能力的多模基礎模型的分類和演進，重點放在從專業模型轉向通用助手。研究範圍包括五個核心主題，分為兩個類別。(i)我們首先對已建立的研究領域進行調查：為特定目的預先訓練的多模基礎模型，包括兩個主題--學習視覺骨幹進行視覺理解的方法和文本到圖像生成。(ii)然後，我們介紹了最近在探索性、開放性研究領域取得的進展：旨在扮演通用助手角色的多模基礎模型，包括三個主題--受大型語言模型啟發的統一視覺模型、多模統一語言模型的端對端訓練，以及將多模工具與語言模型進行鏈接。本文的目標讀者是計算機視覺和視覺語言多模社區中渴望了解多模基礎模型基礎知識和最新進展的研究人員、研究生和專業人士。

English

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

多模基礎模型：從專家到通用助手

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

摘要

Support