ChatPaper.aiChatPaper

HumanVid:解密用于可控摄像头人类图像动画的训练数据

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

July 24, 2024
作者: Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin
cs.AI

摘要

人类图像动画涉及从角色照片生成视频,允许用户控制并释放视频和电影制作的潜力。尽管最近的方法利用高质量的训练数据取得了令人印象深刻的结果,但这些数据集的不可访问性阻碍了公平和透明的基准测试。此外,这些方法优先考虑2D人体运动,忽视视频中摄像机运动的重要性,导致控制有限且视频生成不稳定。为了揭示训练数据的神秘,我们提出了HumanVid,这是专为人类图像动画量身定制的首个大规模高质量数据集,结合了精心制作的真实世界和合成数据。对于真实世界数据,我们从互联网上汇编了大量的免版权真实世界视频。通过精心设计的基于规则的过滤策略,我们确保包含高质量视频,最终形成了一个包含20K个1080P分辨率以人为中心的视频集合。人体和摄像机运动注释是通过2D姿势估计器和基于SLAM的方法完成的。对于合成数据,我们收集了2,300个免版权的3D角色资产,以增加现有的可用3D资产。值得注意的是,我们引入了基于规则的摄像机轨迹生成方法,使合成流水线能够融入多样化和精确的摄像机运动注释,这在真实世界数据中很少见。为验证HumanVid的有效性,我们建立了一个名为CamAnimate的基准模型,即可控制摄像机的人类动画,考虑了人体和摄像机运动作为条件。通过广泛的实验,我们证明在我们的HumanVid上进行的这种简单基准训练实现了控制人体姿势和摄像机运动的最先进性能,创造了一个新的基准。代码和数据将在https://github.com/zhenzhiwang/HumanVid/ 上公开提供。
English
Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation.To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at https://github.com/zhenzhiwang/HumanVid/.

Summary

AI-Generated Summary

PDF263November 28, 2024