多模态基础模型:从专家到通用助手
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
September 18, 2023
作者: Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao
cs.AI
摘要
本文全面调查了展示视觉和视觉语言能力的多模基础模型的分类法和演变,重点关注从专业模型向通用助手的过渡。研究领域涵盖了五个核心主题,分为两类。(i) 我们首先调查了已建立的研究领域:为特定目的预训练的多模基础模型,包括两个主题 -- 用于视觉理解的学习视觉骨干和文本到图像生成的方法。(ii) 然后,我们介绍了探索性、开放性研究领域的最新进展:旨在扮演通用助手角色的多模基础模型,包括三个主题 -- 受大型语言模型(LLMs)启发的统一视觉模型,多模LLMs的端到端训练,以及将多模工具与LLMs链接起来。本文的目标读者是计算机视觉和视觉语言多模社区的研究人员、研究生和专业人士,他们渴望了解多模基础模型的基础知识和最新进展。
English
This paper presents a comprehensive survey of the taxonomy and evolution of
multimodal foundation models that demonstrate vision and vision-language
capabilities, focusing on the transition from specialist models to
general-purpose assistants. The research landscape encompasses five core
topics, categorized into two classes. (i) We start with a survey of
well-established research areas: multimodal foundation models pre-trained for
specific purposes, including two topics -- methods of learning vision backbones
for visual understanding and text-to-image generation. (ii) Then, we present
recent advances in exploratory, open research areas: multimodal foundation
models that aim to play the role of general-purpose assistants, including three
topics -- unified vision models inspired by large language models (LLMs),
end-to-end training of multimodal LLMs, and chaining multimodal tools with
LLMs. The target audiences of the paper are researchers, graduate students, and
professionals in computer vision and vision-language multimodal communities who
are eager to learn the basics and recent advances in multimodal foundation
models.