展示:一个单一的Transformer来统一多模态理解和生成
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
August 22, 2024
作者: Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou
cs.AI
摘要
我们提出了一个统一的Transformer,即Show-o,它统一了多模态理解和生成。与完全自回归模型不同,Show-o将自回归和(离散)扩散建模统一起来,以自适应地处理各种和混合模态的输入和输出。这个统一模型灵活地支持广泛的视觉-语言任务,包括视觉问答、文本到图像生成、文本引导的修复/外推,以及混合模态生成。在各种基准测试中,它展现出与现有个别模型相当或更优越的性能,而这些个别模型具有相同或更多参数,专门用于理解或生成。这明显突显了它作为下一代基础模型的潜力。代码和模型已发布在https://github.com/showlab/Show-o。
English
We present a unified transformer, i.e., Show-o, that unifies multimodal
understanding and generation. Unlike fully autoregressive models, Show-o
unifies autoregressive and (discrete) diffusion modeling to adaptively handle
inputs and outputs of various and mixed modalities. The unified model flexibly
supports a wide range of vision-language tasks including visual
question-answering, text-to-image generation, text-guided
inpainting/extrapolation, and mixed-modality generation. Across various
benchmarks, it demonstrates comparable or superior performance to existing
individual models with an equivalent or larger number of parameters tailored
for understanding or generation. This significantly highlights its potential as
a next-generation foundation model. Code and models are released at
https://github.com/showlab/Show-o.Summary
AI-Generated Summary