PoseDreamer：基于扩散模型的可扩展、逼真人体数据生成流程

摘要

由于深度模糊性以及从单目图像标注三维几何的固有难度，获取用于三维人体网格估计的标注数据集具有挑战性。现有数据集要么是真实数据集（包含人工标注的三维几何但规模有限），要么是基于三维引擎渲染的合成数据集（能提供精确标注但存在逼真度不足、多样性低和制作成本高的问题）。本研究探索了第三条路径：生成式数据。我们提出PoseDreamer——一种利用扩散模型生成带三维网格标注的大规模合成数据集的新型流程。该方法将可控图像生成与基于直接偏好优化的控制对齐、课程式难样本挖掘及多阶段质量过滤相结合。这些组件共同保持了三维标注与生成图像间的自然对应关系，同时优先选择具有挑战性的样本以最大化数据集效用。通过PoseDreamer，我们生成了超过50万个高质量合成样本，其图像质量指标较基于渲染的数据集提升76%。使用PoseDreamer训练出的模型性能媲美甚至优于基于真实数据集和传统合成数据集训练的模型。此外，将PoseDreamer与合成数据集结合使用，能获得优于真实数据集与合成数据集组合的效果，证明了我们数据集的互补特性。我们将公开完整数据集及生成代码。

English

Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.