JanusFlow:将自回归和修正流结合以实现统一的多模态理解与生成
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
November 12, 2024
作者: Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan
cs.AI
摘要
我们提出了JanusFlow,这是一个强大的框架,将图像理解和生成统一在一个模型中。JanusFlow引入了一种极简的架构,将自回归语言模型与修正流结合起来,后者是生成建模中的一种最先进方法。我们的关键发现表明,修正流可以直接在大型语言模型框架内进行训练,无需复杂的架构修改。为了进一步提高我们统一模型的性能,我们采用了两种关键策略:(i) 解耦理解和生成编码器,以及(ii) 在统一训练期间对它们的表示进行对齐。大量实验证明,JanusFlow在各自领域中达到了与专门模型相媲美或优越的性能,同时在标准基准测试中明显优于现有的统一方法。这项工作代表了迈向更高效、更多才多艺的视觉-语言模型的一步。
English
We present JanusFlow, a powerful framework that unifies image understanding
and generation in a single model. JanusFlow introduces a minimalist
architecture that integrates autoregressive language models with rectified
flow, a state-of-the-art method in generative modeling. Our key finding
demonstrates that rectified flow can be straightforwardly trained within the
large language model framework, eliminating the need for complex architectural
modifications. To further improve the performance of our unified model, we
adopt two key strategies: (i) decoupling the understanding and generation
encoders, and (ii) aligning their representations during unified training.
Extensive experiments show that JanusFlow achieves comparable or superior
performance to specialized models in their respective domains, while
significantly outperforming existing unified approaches across standard
benchmarks. This work represents a step toward more efficient and versatile
vision-language models.Summary
AI-Generated Summary