OmniGen: 統合画像生成

要旨

本研究では、統合画像生成のための新しい拡散モデルであるOmniGenを紹介します。一般的な拡散モデル（例：Stable Diffusion）とは異なり、OmniGenにはControlNetやIP-Adapterなどの追加モジュールが不要となりました。OmniGenの特徴は以下の通りです：1）統一性：OmniGenは、テキストから画像を生成する能力だけでなく、画像編集、主体駆動生成、および視覚条件付き生成など他の下流タスクを内在的にサポートします。さらに、OmniGenは、エッジ検出や人物姿勢認識などの古典的なコンピュータビジョンタスクを画像生成タスクに変換することで対応できます。2）シンプリシティ：OmniGenのアーキテクチャは非常に簡略化されており、追加のテキストエンコーダーが不要となっています。さらに、既存の拡散モデルと比較してユーザーフレンドリーであり、追加の前処理ステップ（例：人物姿勢推定）なしで指示によって複雑なタスクを達成できるため、画像生成のワークフローが大幅に簡素化されます。3）知識転送：統一形式で学習することにより、OmniGenは異なるタスク間で知識を効果的に転送し、見慣れないタスクやドメインを管理し、新しい機能を示すことができます。また、モデルの推論能力と連想メカニズムの潜在的な応用も探究します。この研究は、汎用画像生成モデルへの初の試みであり、いくつかの未解決の問題が残っています。この分野の進歩を促進するために、関連リソースをhttps://github.com/VectorSpaceLab/OmniGenでオープンソース化します。

English

In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.

OmniGen: 統合画像生成

OmniGen: Unified Image Generation

要旨

Support