PhysiFormer: ワールド空間における力学のシミュレーションを学習する

要旨

本研究では、物理的に妥当な3D物体の動きを生成する拡散トランスフォーマー「PhysiFormer」を提案する。動画世界モデルが視点に依存したピクセル空間で動作するのに対し、PhysiFormerは物体をワールド座標系で表現された3Dメッシュとして扱う。本モデルは、初期の頂点位置と速度、さらに物体の材質タイプ（剛体または弾性体）が与えられると、将来の頂点軌跡をサンプリングする。関連するニューラル物理アプローチでは、アドホックな潜在空間に依存したり、剛性や因果性を明示的に強制したりするが、PhysiFormerは、頂点軌跡予測をワールド座標上での単一の拡散ノイズ除去プロセスとして定式化することで、そのような帰納バイアスを一切用いずに優れた結果が得られることを示す。この確率的定式化は学習された力学の不確実性を捉え、初期条件から多様な可能な未来を生成することを可能にし、観測されない不確実性を伴う応用において有用な枠組みとなる。本モデルは、効率性のために時間、空間、物体にわたって注意機構を分解しており、明示的な物体エンコーディングを必要とせずに置換不変なマルチ物体推論を実現する。10万以上のシミュレーション軌跡で学習されたPhysiFormerは、剛体および弾性体の力学を生成し、混合材質設定、未観測の実世界形状、より多数の物体への一般化を実現する。軌跡精度、剛性保存、運動量に基づく物理的一貫性において、自己回帰ベースラインを大幅に上回る。本研究の成果は、座標空間における拡散が、ロボティクス、グラフィックス、物理設計における視点不変かつ幾何学的認識を備えた世界モデリングへの有望な一歩となることを示している。可視化、コード、モデルはhttps://yimingc9.github.io/physiformerで公開している。

English

We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as object material type, rigid or elastic, the model samples future vertex trajectories. While related neural physics approaches build on ad-hoc latent spaces or explicitly enforce rigidity and causality, PhysiFormer shows that excellent results can be obtained without any such inductive biases, by casting vertex trajectory prediction as a single denoising diffusion process directly in world coordinates. The probabilistic formulation captures uncertainty in the learned dynamics, enabling diverse plausible futures from initial conditions, making this framework potentially useful for applications with unobserved uncertainty. The model features attention factorised over time, space, and objects for efficiency, enabling permutation-invariant multi-object reasoning without needing explicit object encoding. Trained on over 100k simulated trajectories, PhysiFormer generates rigid and elastic mechanics, and generalises to mixed-material settings, unseen real-world geometries, and larger object counts. It substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency. Our results position coordinate-space diffusion as a promising step toward view-invariant, geometry-aware world modelling for robotics, graphics, and physical design. Visualisations, code, and models are available at https://yimingc9.github.io/physiformer.