多模基礎模型:從專家到通用助手
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
September 18, 2023
作者: Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao
cs.AI
摘要
本文提供了一份全面調查,探討展示視覺和視覺語言能力的多模基礎模型的分類和演進,重點放在從專業模型轉向通用助手。研究範圍包括五個核心主題,分為兩個類別。(i)我們首先對已建立的研究領域進行調查:為特定目的預先訓練的多模基礎模型,包括兩個主題--學習視覺骨幹進行視覺理解的方法和文本到圖像生成。(ii)然後,我們介紹了最近在探索性、開放性研究領域取得的進展:旨在扮演通用助手角色的多模基礎模型,包括三個主題--受大型語言模型啟發的統一視覺模型、多模統一語言模型的端對端訓練,以及將多模工具與語言模型進行鏈接。本文的目標讀者是計算機視覺和視覺語言多模社區中渴望了解多模基礎模型基礎知識和最新進展的研究人員、研究生和專業人士。
English
This paper presents a comprehensive survey of the taxonomy and evolution of
multimodal foundation models that demonstrate vision and vision-language
capabilities, focusing on the transition from specialist models to
general-purpose assistants. The research landscape encompasses five core
topics, categorized into two classes. (i) We start with a survey of
well-established research areas: multimodal foundation models pre-trained for
specific purposes, including two topics -- methods of learning vision backbones
for visual understanding and text-to-image generation. (ii) Then, we present
recent advances in exploratory, open research areas: multimodal foundation
models that aim to play the role of general-purpose assistants, including three
topics -- unified vision models inspired by large language models (LLMs),
end-to-end training of multimodal LLMs, and chaining multimodal tools with
LLMs. The target audiences of the paper are researchers, graduate students, and
professionals in computer vision and vision-language multimodal communities who
are eager to learn the basics and recent advances in multimodal foundation
models.