ChatPaper.aiChatPaper

MTVCrafter:面向开放世界人体图像动画的四维运动标记化技术

MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

May 15, 2025
作者: Yanbo Ding, Xirui Hu, Zhizhi Guo, Yali Wang
cs.AI

摘要

人体图像动画因其在数字人领域的广泛应用而日益受到关注并迅速发展。然而,现有方法主要依赖二维渲染的姿态图像进行运动引导,这限制了泛化能力,并丢弃了开放世界动画中至关重要的三维信息。为解决这一问题,我们提出了MTVCrafter(运动标记化视频生成器),这是首个直接建模原始三维运动序列(即四维运动)的人体图像动画框架。具体而言,我们引入了4DMoT(四维运动标记器),将三维运动序列量化为四维运动标记。与二维渲染的姿态图像相比,四维运动标记提供了更稳健的时空线索,避免了姿态图像与角色之间严格的像素级对齐,实现了更灵活和分离的控制。随后,我们引入了MV-DiT(运动感知视频扩散变换器)。通过设计独特的运动注意力机制与四维位置编码,MV-DiT能够有效利用运动标记作为四维紧凑且富有表现力的上下文,在复杂的三维世界中进行人体图像动画。因此,这标志着该领域的一大进步,并为姿态引导的人体视频生成开辟了新方向。实验表明,我们的MTVCrafter以6.98的FID-VID分数取得了最先进的成果,比第二名高出65%。得益于强大的运动标记,MTVCrafter还能很好地泛化到各种开放世界角色(单个/多个,全身/半身)及多样风格和场景中。我们的视频演示和代码可在以下链接获取:https://github.com/DINGYANB/MTVCrafter。
English
Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are on: https://github.com/DINGYANB/MTVCrafter.

Summary

AI-Generated Summary

PDF62May 20, 2025