MTVCrafter:面向开放世界人体图像动画的四维运动标记化
MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation
May 15, 2025
作者: Yanbo Ding, Xirui Hu, Zhizhi Guo, Yali Wang
cs.AI
摘要
人體圖像動畫因其在數字人領域的廣泛應用而受到越來越多的關注並迅速發展。然而,現有方法主要依賴於二維渲染的姿態圖像進行運動引導,這限制了泛化能力並丟棄了開放世界動畫中至關重要的三維信息。為解決這一問題,我們提出了MTVCrafter(運動標記化視頻生成器),這是首個直接建模原始三維運動序列(即四維運動)的人體圖像動畫框架。具體而言,我們引入了4DMoT(四維運動標記器)來將三維運動序列量化為四維運動標記。與二維渲染的姿態圖像相比,四維運動標記提供了更為魯棒的時空線索,並避免了姿態圖像與角色之間嚴格的像素級對齊,從而實現了更靈活和分離的控制。接著,我們引入了MV-DiT(運動感知視頻DiT)。通過設計獨特的運動注意力機制與四維位置編碼,MV-DiT能夠有效地利用運動標記作為四維緊湊且富有表現力的上下文,在複雜的三維世界中進行人體圖像動畫。因此,這標誌著該領域的重大進步,並為姿態引導的人體視頻生成開闢了新的方向。實驗表明,我們的MTVCrafter以6.98的FID-VID達到了最先進的水平,超越了第二佳方法65%。得益於魯棒的運動標記,MTVCrafter在各種風格和場景下的多樣化開放世界角色(單個/多個,全身/半身)上也表現出良好的泛化能力。我們的視頻演示和代碼位於:https://github.com/DINGYANB/MTVCrafter。
English
Human image animation has gained increasing attention and developed rapidly
due to its broad applications in digital humans. However, existing methods rely
largely on 2D-rendered pose images for motion guidance, which limits
generalization and discards essential 3D information for open-world animation.
To tackle this problem, we propose MTVCrafter (Motion Tokenization Video
Crafter), the first framework that directly models raw 3D motion sequences
(i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT
(4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens.
Compared to 2D-rendered pose images, 4D motion tokens offer more robust
spatio-temporal cues and avoid strict pixel-level alignment between pose image
and character, enabling more flexible and disentangled control. Then, we
introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention
with 4D positional encodings, MV-DiT can effectively leverage motion tokens as
4D compact yet expressive context for human image animation in the complex 3D
world. Hence, it marks a significant step forward in this field and opens a new
direction for pose-guided human video generation. Experiments show that our
MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98,
surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter
also generalizes well to diverse open-world characters (single/multiple,
full/half-body) across various styles and scenarios. Our video demos and code
are on: https://github.com/DINGYANB/MTVCrafter.Summary
AI-Generated Summary