EgoPhys: 自己中心ビデオからの変形可能物体の汎化可能な物理モデルの学習

要旨

人間は日常的な相互作用を通じて物体の物理的性質を自然に理解するが、弾性材料や布地などの複雑な変形力学を正確に予測することは、コンピュータビジョンやロボティクスにおいて依然として大きな課題である。本稿では、一般化可能な事前知識を用いて、自己中心視点のRGBビデオのみから変形可能な物理デジタルツインを構築するフレームワーク「EgoPhys」を提案する。EgoPhysは既存手法の限界を克服し、物体ごとの逆物理解法をコンパクトなコードブックに蒸留することで、自己中心視点ビデオからの制御可能な変形可能デジタルツイン生成を実現する。これにより、テスト時にバネごとの最適化を必要とせず、未観測の物体に対する密なバネ剛性場の予測が可能となる。多様な自己中心視点の相互作用から得られた一般化可能な事前知識で学習されたEgoPhysは、再構成、未来予測、ゼロショット汎化においてベースラインを上回る性能を示す。学習と評価を支援するため、多様な変形可能物体、シーン、操作スタイルを網羅した自己中心視点の相互作用データセットを収集した。また、実際のxArm6ロボットにEgoPhysを適用し、単一の自己中心視点による人間のプレイビデオから初期化されたデジタルツインが、内部世界表現として機能し、変形可能物体の計画を支援することを実証する。これにより、自己中心視点のRGB観測が、現実からシミュレーションへのパイプラインへのスケーラブルな経路となることが示される。

English

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.