PhyMotion：基於物理的人體影片生成之結構化三維運動獎勵

摘要

生成逼真的人体运动是视频生成领域核心但尚未解决的挑战。尽管基于强化学习的后训练技术近期提升了视频整体质量，但将其扩展至人体运动仍受限于奖励信号——现有方法无法可靠评估运动真实性。当前视频奖励主要依赖二维感知信号，未能显式建模关节化人体运动所需的三维体态、接触与动力学特征，常对漂浮身体或物理不合理动作赋予高分。为此，本文提出PhyMotion——一种结构化细粒度运动奖励机制，通过将恢复的三维人体轨迹定锚于物理模拟器，沿物理可行性多维度评估运动质量。具体而言，我们从生成视频中恢复SMPL人体网格，将其重定向至MuJoCo物理模拟器中的类人模型，沿三个维度评估运动：运动学合理性、接触与平衡一致性、动态可行性。每个组件提供与运动质量特定方面关联的连续可解释信号，使得奖励能够捕捉运动在物理层面正确或违规的具体表现。实验表明，PhyMotion与人类判断的相关性优于现有奖励方案。这些优势延伸至基于强化学习的后训练环节：相较于优化现有奖励，优化PhyMotion能带来更大且更一致的提升，在自回归与双向视频生成器上均显著改善运动真实性（自动指标与盲人机评估中Elo评分提升+68）。消融实验显示，三个维度提供互补监督信号，而该奖励仅需适度训练开销即可保持视频整体生成质量。

English

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.