ChatPaper.aiChatPaper

HumanVid:揭開相機可控人像圖像動畫訓練數據的神秘面紗

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

July 24, 2024
作者: Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin
cs.AI

摘要

人類影像動畫涉及從角色照片生成視頻,允許用戶控制並開啟視頻和電影製作的潛力。儘管最近的方法利用高質量的訓練數據取得了令人印象深刻的成果,但這些數據集的不可及性阻礙了公平和透明的基準測試。此外,這些方法優先考慮2D人體運動,忽略了視頻中相機運動的重要性,導致控制能力有限且視頻生成不穩定。為了揭開訓練數據的神秘面紗,我們提出了HumanVid,這是第一個針對人類影像動畫量身定制的大規模高質量數據集,結合了精心製作的現實世界和合成數據。對於現實世界的數據,我們從互聯網上編譯了大量的免版稅現實世界視頻。通過一個精心設計的基於規則的過濾策略,我們確保包含高質量的視頻,從而獲得一個包含20K個以人為中心的1080P分辨率視頻的集合。使用2D姿勢估計器和基於SLAM的方法完成人體和相機運動標註。對於合成數據,我們收集了2,300個免版稅的3D角色資產,以增強現有的可用3D資產。值得注意的是,我們引入了基於規則的相機軌跡生成方法,使合成管道能夠融入多樣且精確的相機運動標註,這在現實世界數據中很少見。為驗證HumanVid的有效性,我們建立了一個名為CamAnimate的基準模型,即相機可控人體動畫,考慮了人體和相機運動作為條件。通過大量實驗,我們展示了在我們的HumanVid上進行的這種簡單基準訓練實現了控制人體姿勢和相機運動的最新性能,創立了一個新的基準。代碼和數據將在https://github.com/zhenzhiwang/HumanVid/ 上公開提供。
English
Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation.To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at https://github.com/zhenzhiwang/HumanVid/.

Summary

AI-Generated Summary

PDF263November 28, 2024