PhyMotion: 물리 기반 인간 비디오 생성을 위한 구조화된 3D 모션 보상

초록

현실적인 인간 움직임을 생성하는 것은 비디오 생성에서 핵심적이면서도 아직 해결되지 않은 과제이다. 강화학습 기반 사후 훈련은 일반 비디오 품질 향상에 최근 큰 진전을 가져왔지만, 이를 인간 움직임으로 확장하는 것은 움직임의 사실성을 신뢰성 있게 평가할 수 없는 보상 신호에 의해 병목 현상이 발생한다. 기존 비디오 보상은 주로 2D 지각 신호에 의존하며, 관절로 연결된 인간 움직임의 기반이 되는 3D 신체 상태, 접촉 및 동역학을 명시적으로 모델링하지 않으며, 종종 떠 있는 몸체나 물리적으로 타당하지 않은 움직임이 포함된 비디오에 높은 점수를 부여한다. 이를 해결하기 위해, 우리는 PhyMotion을 제안한다. 이는 복원된 3D 인간 궤적을 물리 시뮬레이터에 기반하여 정교하고 세분화된 움직임 보상으로, 물리적 실현 가능성의 여러 차원을 따라 움직임 품질을 평가한다. 구체적으로, 생성된 비디오에서 SMPL 신체 메시를 복원하고, 이를 MuJoCo 물리 시뮬레이터의 휴머노이드에 재표적화한 후, 결과 움직임을 운동학적 타당성, 접촉 및 균형 일관성, 동역학적 실현 가능성이라는 세 가지 축을 따라 평가한다. 각 구성 요소는 움직임 품질의 특정 측면과 연결된 연속적이고 해석 가능한 신호를 제공하여, 보상이 움직임의 어떤 측면이 물리적으로 올바른지 또는 위반되었는지를 포착할 수 있게 한다. 실험 결과, PhyMotion은 기존 보상 공식보다 인간 판단과 더 강한 상관관계를 보였다. 이러한 이점은 강화학습 기반 사후 훈련으로 이어지며, PhyMotion을 최적화하면 기존 보상을 최적화할 때보다 더 크고 일관된 개선을 가져와, 자기회귀 및 양방향 비디오 생성기 모두에서 자동 평가 지표와 블라인드 인간 평가(+68 Elo 점수 향상) 모두에서 움직임 사실성이 향상되었다. 절제 실험은 세 가지 축이 상호 보완적인 지도 신호를 제공하며, 보상이 전체 비디오 생성 품질을 유지하고 훈련 오버헤드도 적은 수준임을 보여준다.

English

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.