JanusFlow：自己回帰と整流フローを調和させた統一された多モーダル理解と生成

要旨

JanusFlowという強力なフレームワークを提案します。このフレームワークは、画像理解と生成を1つのモデルで統合します。JanusFlowは、自己回帰言語モデルと生成モデリングにおける最先端の手法である修正フローを統合するミニマリストなアーキテクチャを導入します。私たちの主な発見は、修正フローが大規模言語モデルフレームワーク内で簡単に訓練でき、複雑なアーキテクチャの変更が不要であることを示しています。統合モデルの性能をさらに向上させるために、2つの主要な戦略を採用しています：(i) 理解と生成のエンコーダーを分離し、(ii) 統合トレーニング中にそれらの表現を整合させることです。多くの実験により、JanusFlowが専門モデルと比較して各ドメインで同等または優れた性能を達成し、標準ベンチマーク全体で既存の統合アプローチを大幅に上回ることが示されました。この研究は、より効率的で多目的なビジョン言語モデルに向けた一歩を表しています。

English

We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.

JanusFlow：自己回帰と整流フローを調和させた統一された多モーダル理解と生成

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

要旨

Support