DreamLLM:协同多模态理解与创作
DreamLLM: Synergistic Multimodal Comprehension and Creation
September 20, 2023
作者: Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi
cs.AI
摘要
本文介绍了DreamLLM,这是一个学习框架,首次实现了功能强大的多模态大型语言模型(MLLMs),充分利用了多模态理解和创作之间经常被忽视的协同作用。DreamLLM基于两个基本原则运作。第一个原则侧重于通过在原始多模态空间中直接采样来生成语言和图像后验模型。这种方法规避了类似CLIP这样的外部特征提取器固有的限制和信息丢失问题,并实现了更全面的多模态理解。第二,DreamLLM促进了生成原始的交错文档,对文本和图像内容以及非结构化布局进行建模。这使得DreamLLM能够有效地学习所有条件、边际和联合多模态分布。因此,DreamLLM是第一个能够生成自由形式交错内容的MLLM。全面的实验突显了DreamLLM作为零-shot多模态通用主义者的卓越表现,从增强的学习协同作用中获益。
English
This paper presents DreamLLM, a learning framework that first achieves
versatile Multimodal Large Language Models (MLLMs) empowered with frequently
overlooked synergy between multimodal comprehension and creation. DreamLLM
operates on two fundamental principles. The first focuses on the generative
modeling of both language and image posteriors by direct sampling in the raw
multimodal space. This approach circumvents the limitations and information
loss inherent to external feature extractors like CLIP, and a more thorough
multimodal understanding is obtained. Second, DreamLLM fosters the generation
of raw, interleaved documents, modeling both text and image contents, along
with unstructured layouts. This allows DreamLLM to learn all conditional,
marginal, and joint multimodal distributions effectively. As a result, DreamLLM
is the first MLLM capable of generating free-form interleaved content.
Comprehensive experiments highlight DreamLLM's superior performance as a
zero-shot multimodal generalist, reaping from the enhanced learning synergy.