PhyMotion: 物理に基づく人物映像生成のための構造化3Dモーション報酬

要旨

現実的な人間の動作を生成することは、ビデオ生成における中心的な課題でありながら未だ解決されていない。強化学習（RL）に基づくポストトレーニングは、一般的なビデオ品質の最近の向上を牽引してきたが、これを人間の動作に拡張するには、動作の現実性を確実に評価できない報酬信号がボトルネックとなっている。既存のビデオ報酬は主に2次元の知覚信号に依存しており、関節で連結された人間の動作の根底にある3次元の身体状態、接触、力学を明示的にモデル化しておらず、浮遊する身体や物理的に非現実的な動きを含むビデオに高いスコアを割り当てることが多い。この問題に対処するため、我々はPhyMotionを提案する。これは、復元された3次元の人間軌跡を物理シミュレータに接地し、物理的実現可能性の複数の次元に沿って動作品質を評価する、構造化された細粒度の動作報酬である。具体的には、生成されたビデオからSMPLボディメッシュを復元し、それをMuJoCo物理シミュレータ内のヒューマノイドにリターゲットし、得られた動作を三つの軸（運動学的妥当性、接触とバランスの一貫性、動的実現可能性）に沿って評価する。各構成要素は、動作品質の特定の側面に関連付けられた連続的で解釈可能な信号を提供し、報酬が動作のどの側面が物理的に正しいか、または違反されているかを捉えることを可能にする。実験では、PhyMotionが既存の報酬定式化よりも人間の判断との相関が強いことを示す。これらの利点はRLベースのポストトレーニングにも引き継がれ、PhyMotionを最適化することで、既存の報酬を最適化するよりも大きく一貫した改善が得られ、自己回帰型および双方向型の両方のビデオ生成器において、自動評価指標およびブラインド人間評価（+68 Eloゲイン）の下で動作の現実性が向上する。アブレーション研究では、三つの軸が相補的な監視信号を提供し、報酬が全体的なビデオ生成品質を維持し、トレーニングのオーバーヘッドもわずかであることが示されている。

English

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.