OmniHuman-1:重新思考單階段條件人類動畫模型的擴展
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
February 3, 2025
作者: Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang
cs.AI
摘要
近年來,端對端人類動畫,如音訊驅動的說話人類生成,已經取得顯著進展。然而,現有方法仍然難以擴展為大型的一般視頻生成模型,限制了它們在實際應用中的潛力。在本文中,我們提出了一個名為 OmniHuman 的基於擴散 Transformer 的框架,通過將與運動相關的條件混合到訓練階段來擴展數據。為此,我們引入了兩個用於這些混合條件的訓練原則,以及相應的模型架構和推理策略。這些設計使 OmniHuman 能夠充分利用數據驅動的運動生成,最終實現高度逼真的人類視頻生成。更重要的是,OmniHuman 支持各種肖像內容(面部特寫、肖像、半身、全身),支持說話和唱歌,處理人與物體的互動和具有挑戰性的身體姿勢,並適應不同的圖像風格。與現有的端對端音訊驅動方法相比,OmniHuman 不僅能夠生成更逼真的視頻,還能夠在輸入方面提供更大的靈活性。它還支持多種驅動模式(音訊驅動、視頻驅動和組合驅動信號)。視頻樣本可在 ttfamily 項目頁面(https://omnihuman-lab.github.io)上找到。
English
End-to-end human animation, such as audio-driven talking human generation,
has undergone notable advancements in the recent few years. However, existing
methods still struggle to scale up as large general video generation models,
limiting their potential in real applications. In this paper, we propose
OmniHuman, a Diffusion Transformer-based framework that scales up data by
mixing motion-related conditions into the training phase. To this end, we
introduce two training principles for these mixed conditions, along with the
corresponding model architecture and inference strategy. These designs enable
OmniHuman to fully leverage data-driven motion generation, ultimately achieving
highly realistic human video generation. More importantly, OmniHuman supports
various portrait contents (face close-up, portrait, half-body, full-body),
supports both talking and singing, handles human-object interactions and
challenging body poses, and accommodates different image styles. Compared to
existing end-to-end audio-driven methods, OmniHuman not only produces more
realistic videos, but also offers greater flexibility in inputs. It also
supports multiple driving modalities (audio-driven, video-driven and combined
driving signals). Video samples are provided on the ttfamily project page
(https://omnihuman-lab.github.io)Summary
AI-Generated Summary