CosmicMan:面向人类的文本到图像基础模型
CosmicMan: A Text-to-Image Foundation Model for Humans
April 1, 2024
作者: Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu
cs.AI
摘要
我们推出了CosmicMan,这是一种专为生成高保真人类图像而设计的文本到图像基础模型。与当前陷入人类图像质量低下和图文不匹配困境的通用基础模型不同,CosmicMan能够生成具有细致外观、合理结构和精确图文对齐的逼真人类图像,且这些图像与详细的密集描述高度一致。CosmicMan成功的核心在于对数据和模型的新思考与新视角:(1) 我们发现,数据质量和可扩展的数据生产流程对于训练模型的最终结果至关重要。因此,我们提出了一种新的数据生产范式——Annotate Anyone,它作为一个持续的数据飞轮,随着时间的推移以准确且成本效益高的方式生成高质量数据。基于此,我们构建了一个大规模数据集,CosmicMan-HQ 1.0,包含600万张高质量的真实世界人类图像,平均分辨率为1488x1255,并附有从1.15亿个多样粒度属性中提取的精确文本注释。(2) 我们认为,专为人类设计的文本到图像基础模型必须实用——易于集成到下游任务中,同时能够有效生成高质量的人类图像。因此,我们提出以分解的方式建模密集文本描述与图像像素之间的关系,并介绍了Decomposed-Attention-Refocusing(Daring)训练框架。该框架无缝分解了现有文本到图像扩散模型中的交叉注意力特征,并通过不增加额外模块的方式强化注意力重聚焦。通过Daring,我们展示了将连续文本空间显式离散化为与人体结构对齐的几个基本组,是轻松解决图文不匹配问题的关键。
English
We present CosmicMan, a text-to-image foundation model specialized for
generating high-fidelity human images. Unlike current general-purpose
foundation models that are stuck in the dilemma of inferior quality and
text-image misalignment for humans, CosmicMan enables generating
photo-realistic human images with meticulous appearance, reasonable structure,
and precise text-image alignment with detailed dense descriptions. At the heart
of CosmicMan's success are the new reflections and perspectives on data and
models: (1) We found that data quality and a scalable data production flow are
essential for the final results from trained models. Hence, we propose a new
data production paradigm, Annotate Anyone, which serves as a perpetual data
flywheel to produce high-quality data with accurate yet cost-effective
annotations over time. Based on this, we constructed a large-scale dataset,
CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean
resolution of 1488x1255, and attached with precise text annotations deriving
from 115 Million attributes in diverse granularities. (2) We argue that a
text-to-image foundation model specialized for humans must be pragmatic -- easy
to integrate into down-streaming tasks while effective in producing
high-quality human images. Hence, we propose to model the relationship between
dense text descriptions and image pixels in a decomposed manner, and present
Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly
decomposes the cross-attention features in existing text-to-image diffusion
model, and enforces attention refocusing without adding extra modules. Through
Daring, we show that explicitly discretizing continuous text space into several
basic groups that align with human body structure is the key to tackling the
misalignment problem in a breeze.Summary
AI-Generated Summary