MoCapAnything V2:面向任意骨骼架构的端到端运动捕捉系统
MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
April 30, 2026
作者: Kehong Gong, Zhengyu Wen, Dao Thien Phong, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Guanli Hou, Dongze Lian, Xiaoyu He, Mingyuan Zhang, Hanwang Zhang
cs.AI
摘要
近期基于单目视频的任意骨架运动捕捉方法普遍采用因子化流程:先通过视频到姿态网络预测关节点位姿,再经由解析式逆运动学阶段恢复关节旋转。该方案虽有效,却存在固有局限——关节点位置无法完全确定旋转状态,会导致骨骼轴向扭转等自由度模糊;且不可微的逆运动学环节使系统难以适应噪声预测或优化最终动画目标。本研究提出首个完全端到端的框架,其中视频到姿态与姿态到旋转两个阶段均可学习并联合优化。我们发现姿态-旋转映射的模糊性源于坐标系信息的缺失:相同的关节点位置在不同初始姿态与局部轴约定下可能对应不同旋转。为此,我们引入目标资产的参考姿态-旋转对,结合初始姿态不仅锚定映射关系,更定义了底层旋转坐标系。这一表述将旋转预测转化为约束良好的条件问题,从而实现高效学习。此外,模型无需依赖网格中间表示即可直接从视频预测关节点位置,提升了鲁棒性与效率。两阶段共享具有骨架感知能力的全局-局部图引导多头注意力模块,实现关节级局部推理与全局协同。在Truebones Zoo和Objaverse上的实验表明,本方法将旋转误差从约17度降至约10度,在未见过的骨架上进一步降至6.54度,同时推理速度比基于网格的流程提升约20倍。项目页面:https://animotionlab.github.io/MoCapAnythingV2/
English
Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/