多模态基础模型：从专家到通用助手

摘要

本文全面调查了展示视觉和视觉语言能力的多模基础模型的分类法和演变，重点关注从专业模型向通用助手的过渡。研究领域涵盖了五个核心主题，分为两类。(i) 我们首先调查了已建立的研究领域：为特定目的预训练的多模基础模型，包括两个主题 -- 用于视觉理解的学习视觉骨干和文本到图像生成的方法。(ii) 然后，我们介绍了探索性、开放性研究领域的最新进展：旨在扮演通用助手角色的多模基础模型，包括三个主题 -- 受大型语言模型（LLMs）启发的统一视觉模型，多模LLMs的端到端训练，以及将多模工具与LLMs链接起来。本文的目标读者是计算机视觉和视觉语言多模社区的研究人员、研究生和专业人士，他们渴望了解多模基础模型的基础知识和最新进展。

English

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

多模态基础模型：从专家到通用助手

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

摘要

Support