DreamLLM:協同多模式理解與創作
DreamLLM: Synergistic Multimodal Comprehension and Creation
September 20, 2023
作者: Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi
cs.AI
摘要
本文提出了DreamLLM,這是一個學習框架,首次實現了功能強大的多模態大型語言模型(MLLMs),並賦予其多模態理解和創作之間經常被忽視的協同作用。DreamLLM運作基於兩個基本原則。第一個著重於透過在原始多模態空間中直接取樣,生成語言和圖像後驗模型。這種方法避開了外部特徵提取器(如CLIP)固有的限制和信息損失,從而獲得更全面的多模態理解。第二,DreamLLM促進生成原始的交錯文檔,對文本和圖像內容以及非結構化布局進行建模。這使DreamLLM能夠有效地學習所有條件、邊際和聯合多模態分佈。因此,DreamLLM是第一個能夠生成自由形式交錯內容的MLLM。全面的實驗突顯了DreamLLM作為零-shot多模態通才的卓越表現,從增強的學習協同作用中受益。
English
This paper presents DreamLLM, a learning framework that first achieves
versatile Multimodal Large Language Models (MLLMs) empowered with frequently
overlooked synergy between multimodal comprehension and creation. DreamLLM
operates on two fundamental principles. The first focuses on the generative
modeling of both language and image posteriors by direct sampling in the raw
multimodal space. This approach circumvents the limitations and information
loss inherent to external feature extractors like CLIP, and a more thorough
multimodal understanding is obtained. Second, DreamLLM fosters the generation
of raw, interleaved documents, modeling both text and image contents, along
with unstructured layouts. This allows DreamLLM to learn all conditional,
marginal, and joint multimodal distributions effectively. As a result, DreamLLM
is the first MLLM capable of generating free-form interleaved content.
Comprehensive experiments highlight DreamLLM's superior performance as a
zero-shot multimodal generalist, reaping from the enhanced learning synergy.