MoCapAnything:基于单目视频的通用骨架三维运动捕捉系统
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
December 11, 2025
作者: Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang
cs.AI
摘要
动作捕捉技术如今已支撑起远超数字人范畴的内容创作,但现有流程大多仍受限于特定物种或模板。我们将这一差距形式化为类别无关动作捕捉(CAMoCap):给定单目视频和任意绑定骨骼的3D资源作为提示,目标是重建可直接驱动该特定资源的基于旋转的动画(如BVH格式)。我们提出MoCapAnything——一个参考引导的因子化框架,首先生成3D关节轨迹,再通过约束感知逆向运动学计算资源专属旋转。该系统包含三个可学习模块与轻量级IK阶段:(1)参考提示编码器,从资源骨架、网格及渲染图像中提取逐关节查询;(2)视频特征提取器,计算稠密视觉描述符并重建粗糙4D变形网格,以弥合视频与关节空间之间的鸿沟;(3)统一运动解码器,融合多模态线索生成时序连贯的轨迹。我们还构建了Truebones Zoo数据集,包含1038个动作片段,每个片段提供标准化的骨架-网格-渲染三元组。在领域内基准测试和真实场景视频上的实验表明,MoCapAnything能输出高质量骨骼动画,在异构骨骼绑定间实现有意义的跨物种动作重定向,为任意资源实现可扩展的提示驱动式3D动作捕捉。项目页面:https://animotionlab.github.io/MoCapAnything/
English
Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/