CosmicMan: 人間向けのテキストから画像生成基盤モデル

要旨

本論文では、高精細な人物画像生成に特化したテキスト-to-画像基盤モデル「CosmicMan」を提案する。現行の汎用基盤モデルが抱える、人物画像の品質低下とテキスト-画像の不整合というジレンマに対し、CosmicManは写実的な人物画像を生成可能であり、細部まで精巧な外見、合理的な構造、詳細な密な記述に基づく正確なテキスト-画像整合を実現する。CosmicManの成功の核心は、データとモデルに対する新たな洞察と視点にある：(1) データ品質とスケーラブルなデータ生産フローが、学習済みモデルの最終的な結果に不可欠であることを発見した。そこで、高品質なデータを正確かつコスト効率的なアノテーションで継続的に生産する永続的なデータフライホイールとして機能する新たなデータ生産パラダイム「Annotate Anyone」を提案する。これに基づき、平均解像度1488x1255の600万枚の高品質な実世界人物画像と、多様な粒度の1億1500万の属性から導出された正確なテキストアノテーションを備えた大規模データセット「CosmicMan-HQ 1.0」を構築した。(2) 人物に特化したテキスト-to-画像基盤モデルは、下流タスクへの統合が容易でありながら、高品質な人物画像を生成する効果を発揮する実用的なものであるべきと主張する。そこで、密なテキスト記述と画像ピクセルの関係を分解的にモデル化し、「Decomposed-Attention-Refocusing (Daring)」トレーニングフレームワークを提案する。これは既存のテキスト-to-画像拡散モデルのクロスアテンション特徴をシームレスに分解し、追加モジュールなしでアテンションの再フォーカスを強制する。Daringを通じて、連続的なテキスト空間を人体構造に沿ったいくつかの基本グループに明示的に離散化することが、不整合問題を容易に解決する鍵であることを示す。

English

We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.

CosmicMan: 人間向けのテキストから画像生成基盤モデル

CosmicMan: A Text-to-Image Foundation Model for Humans

要旨

Support