PhysisForcing: ロボット操作のための物理強化型ワールドシミュレータ

要旨

ビデオ生成モデルは、具現化された世界シミュレーションの有望なパラダイムとして登場しました。しかし、汎用ドメインのビデオ生成器もロボット固有データでファインチューニングされたモデルも、不連続な動作軌跡や一貫性のないロボットと物体の相互作用など、物理的に非現実的な操作を生成することがあり、これがワールドシミュレータとしての信頼性を制限しています。広範な実験を通じて、このような物理的不安定性は主に二つの要因に起因することがわかりました。それは、移動物体の変形と、相互作用するエンティティ間の非現実的な時空間相関、特に接触時のものです。この観察に基づき、我々はPhysisForcingを提案します。これは、ピクセルレベルとセマンティックレベルの特徴の共同最適化を通じて、物理情報を含む領域に監督を集中させることで物理的一貫性を強化する、スケーラブルなトレーニングフレームワークです。このフレームワークは、参照点軌跡を用いてDiT特徴を監視するピクセルレベルの軌跡アライメント損失と、凍結されたビデオ理解エンコーダから抽出された領域間関係とDiT特徴を一致させるセマンティックレベルの関係アライメント損失から構成されます。R-Bench、PAI-Bench、EZS-Benchにおける広範な実験により、PhysisForcingが強力なベースラインに対して一貫して具現化ビデオ生成を改善し、R-BenchにおいてWan2.2-I2V-A14BとCosmos3-Nanoのベースモデルをそれぞれ22.3%と9.2%（バニラファインチューニングに対しては7.1%と3.7%）向上させ、Cosmos3-Nano変種が最高の総合スコアを達成したことが示されました。生成を超えて、WorldArenaアクションプランナープロトコル下のワールドモデルとして、閉ループ成功率を16.0%から24.0%に引き上げ、さらに下流のポリシー成功率を改善しており、物理的に整合されたビデオモデルがロボット操作のためのより強力な表現を生み出すことを示しています。

English

Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, including discontinuous motion trajectories and inconsistent robot-object interactions, which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities, particularly during contact. Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations extracted from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, improving the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by 22.3\% and 9.2\% (7.1\% and 3.7\% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0\% to 24.0\% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.