CosmicMan:一個針對人類的文本到圖像基礎模型
CosmicMan: A Text-to-Image Foundation Model for Humans
April 1, 2024
作者: Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu
cs.AI
摘要
我們介紹了CosmicMan,一個專為生成高保真人類圖像而設計的文本到圖像基礎模型。與目前陷入人類圖像質量不佳和文本-圖像不對齊困境的通用基礎模型不同,CosmicMan能夠生成外貌精細、結構合理且文本-圖像精確對齊的逼真人類圖像,並配有詳細的密集描述。CosmicMan成功的核心在於對數據和模型的新反思和觀點:(1) 我們發現數據質量和可擴展的數據生成流對於訓練模型的最終結果至關重要。因此,我們提出了一種新的數據生成範式,名為Annotate Anyone,它作為一個永續的數據動力輪,隨著時間不斷生成具有準確且具有成本效益的標註高質量數據。基於此,我們構建了一個大規模數據集,CosmicMan-HQ 1.0,其中包含600萬張高質量的現實世界人類圖像,平均分辨率為1488x1255,並附有源自1.15億個不同細粒度屬性的精確文本標註。(2) 我們認為,一個專為人類而設的文本到圖像基礎模型必須是務實的——易於集成到下游任務中,同時能夠有效生成高質量人類圖像。因此,我們提出以分解方式建模密集文本描述和圖像像素之間的關係,並提出Decomposed-Attention-Refocusing(Daring)訓練框架。它無縫地分解現有文本到圖像擴散模型中的交叉注意力特徵,並在不添加額外模塊的情況下強化注意力重新聚焦。通過Daring,我們展示將連續文本空間明確離散化為幾個與人體結構對齊的基本群體是輕鬆解決不對齊問題的關鍵。
English
We present CosmicMan, a text-to-image foundation model specialized for
generating high-fidelity human images. Unlike current general-purpose
foundation models that are stuck in the dilemma of inferior quality and
text-image misalignment for humans, CosmicMan enables generating
photo-realistic human images with meticulous appearance, reasonable structure,
and precise text-image alignment with detailed dense descriptions. At the heart
of CosmicMan's success are the new reflections and perspectives on data and
models: (1) We found that data quality and a scalable data production flow are
essential for the final results from trained models. Hence, we propose a new
data production paradigm, Annotate Anyone, which serves as a perpetual data
flywheel to produce high-quality data with accurate yet cost-effective
annotations over time. Based on this, we constructed a large-scale dataset,
CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean
resolution of 1488x1255, and attached with precise text annotations deriving
from 115 Million attributes in diverse granularities. (2) We argue that a
text-to-image foundation model specialized for humans must be pragmatic -- easy
to integrate into down-streaming tasks while effective in producing
high-quality human images. Hence, we propose to model the relationship between
dense text descriptions and image pixels in a decomposed manner, and present
Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly
decomposes the cross-attention features in existing text-to-image diffusion
model, and enforces attention refocusing without adding extra modules. Through
Daring, we show that explicitly discretizing continuous text space into several
basic groups that align with human body structure is the key to tackling the
misalignment problem in a breeze.Summary
AI-Generated Summary