MoCapAnything V2：任意のスケルトンに対するエンドツーエンドモーションキャプチャ

要旨

単眼カメラ映像からの任意骨格モーションキャプチャにおける近年の手法は、Video-to-Poseネットワークが関節位置を予測し、解析的な逆運動学ステージが関節回転を復元する、因子分解されたパイプラインに従っている。この設計は有効であるが、関節位置だけでは回転を完全には決定できず骨軸捻りなどの自由度が曖昧になるという本質的限界があり、非微分可能な逆運動学ステージにより、ノイズの多い予測への適応や最終的なアニメーション目標のための最適化が妨げられる。本研究では、Video-to-PoseとPose-to-Rotationの両方が学習可能かつ共同で最適化される、初の完全なエンドツーエンドフレームワークを提案する。我々は、ポーズから回転へのマッピングの曖昧さは、座標系情報の欠如に起因することを確認した。すなわち、同じ関節位置が、異なるレストポーズやローカル軸の規約の下では異なる回転に対応し得る。これを解決するため、対象アセットからの参照ポーズ-回転ペアを導入する。これはレストポーズと共に、マッピングを固定するだけでなく、基礎となる回転座標系を定義する。この定式化により、回転予測は適切に制約された条件付き問題となり、効果的な学習が可能となる。さらに、我々のモデルはメッシュ中間表現に依存せずに映像から直接関節位置を予測し、堅牢性と効率の両方を向上させる。両ステージは、関節レベルの局所的な推論と大域的な調整のための、骨格を考慮したGlobal-Local Graph-guided Multi-Head Attentionモジュールを共有する。Truebones ZooおよびObjaverseでの実験により、本手法は回転誤差を約17度から約10度に、未見の骨格では6.54度に低減し、メッシュベースのパイプラインより約20倍高速な推論を実現することを示す。プロジェクトページ: https://animotionlab.github.io/MoCapAnythingV2/

English

Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/

MoCapAnything V2：任意のスケルトンに対するエンドツーエンドモーションキャプチャ

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

要旨

Support