DreamLLM: 시너지 효과를 내는 다중모달 이해 및 창조

초록

본 논문은 다중모드 이해와 생성 간의 자주 간과되는 시너지를 활용하여 다재다능한 다중모드 대형 언어 모델(MLLM)을 최초로 구현한 학습 프레임워크인 DreamLLM을 소개한다. DreamLLM은 두 가지 기본 원칙에 기반하여 작동한다. 첫 번째 원칙은 원시 다중모드 공간에서 직접 샘플링을 통해 언어와 이미지의 사후 확률 분포를 생성적으로 모델링하는 데 초점을 맞춘다. 이 접근법은 CLIP과 같은 외부 특징 추출기의 한계와 정보 손실을 극복하며, 보다 철저한 다중모드 이해를 가능하게 한다. 두 번째 원칙은 DreamLLM이 텍스트와 이미지 콘텐츠뿐만 아니라 비정형 레이아웃을 모델링하여 원시적이고 인터리브된 문서를 생성하도록 촉진한다. 이를 통해 DreamLLM은 모든 조건부, 주변 및 결합 다중모드 분포를 효과적으로 학습할 수 있다. 결과적으로 DreamLLM은 자유형식의 인터리브된 콘텐츠를 생성할 수 있는 최초의 MLLM으로 자리매김한다. 포괄적인 실험을 통해 DreamLLM이 향상된 학습 시너지로부터 얻은 제로샷 다중모드 일반주의자로서의 우수한 성능을 입증한다.

English

This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy.

DreamLLM: 시너지 효과를 내는 다중모달 이해 및 창조

DreamLLM: Synergistic Multimodal Comprehension and Creation

초록

Support