MIO:一种关于多模态标记的基础模型
MIO: A Foundation Model on Multimodal Tokens
September 26, 2024
作者: Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang
cs.AI
摘要
本文介绍了一种基于多模态标记的新型基础模型 MIO,能够以端到端、自回归的方式理解和生成语音、文本、图像和视频。尽管大型语言模型(LLMs)和多模态大型语言模型(MM-LLMs)的出现通过其多功能能力推动了人工通用智能的发展,但它们仍然缺乏真正的任意-任意理解和生成能力。最近发布的 GPT-4o 展示了任意-任意 LLMs 对于复杂现实任务的显著潜力,实现了跨图像、语音和文本的全向输入和输出。然而,它是闭源的,不支持生成多模态交错序列。为了填补这一空白,我们提出了 MIO,它通过因果多模态建模,训练了跨四种模态的离散标记混合。MIO 经历了四阶段的训练过程:(1)对齐预训练,(2)交错预训练,(3)增强语音预训练,以及(4)在多样的文本、视觉和语音任务上进行全面监督微调。我们的实验结果表明,MIO 在某些情况下展现出与之前的双模基线、任意-任意模型基线甚至模态特定基线相比具有竞争力,甚至更优越的性能。此外,MIO 展示了其任意-任意特性固有的先进能力,如交错视频-文本生成、视觉思维链推理、视觉指南生成、指导性图像编辑等。
English
In this paper, we introduce MIO, a novel foundation model built on multimodal
tokens, capable of understanding and generating speech, text, images, and
videos in an end-to-end, autoregressive manner. While the emergence of large
language models (LLMs) and multimodal large language models (MM-LLMs) propels
advancements in artificial general intelligence through their versatile
capabilities, they still lack true any-to-any understanding and generation.
Recently, the release of GPT-4o has showcased the remarkable potential of
any-to-any LLMs for complex real-world tasks, enabling omnidirectional input
and output across images, speech, and text. However, it is closed-source and
does not support the generation of multimodal interleaved sequences. To address
this gap, we present MIO, which is trained on a mixture of discrete tokens
across four modalities using causal multimodal modeling. MIO undergoes a
four-stage training process: (1) alignment pre-training, (2) interleaved
pre-training, (3) speech-enhanced pre-training, and (4) comprehensive
supervised fine-tuning on diverse textual, visual, and speech tasks. Our
experimental results indicate that MIO exhibits competitive, and in some cases
superior, performance compared to previous dual-modal baselines, any-to-any
model baselines, and even modality-specific baselines. Moreover, MIO
demonstrates advanced capabilities inherent to its any-to-any feature, such as
interleaved video-text generation, chain-of-visual-thought reasoning, visual
guideline generation, instructional image editing, etc.Summary
AI-Generated Summary