EgoPhys：從第一人稱視頻學習可變形物體的可泛化物理模型

摘要

人類透過日常互動自然地理解物體物理，但要準確預測如彈性材料和布料等複雜可變形動力學，仍是電腦視覺與機器人學的重大挑戰。我們提出EgoPhys框架，該框架利用可泛化先驗，從僅含RGB的第一人稱影片中建構可變形物理數位孿生。EgoPhys克服現有方法的限制，藉由將每個物體的逆物理求解結果蒸餾至一個緊湊碼本中，實現從第一人稱影片生成可控可變形數位孿生，無需針對每個彈簧進行測試時最佳化即可預測未見物體的密集彈簧勁度場。利用多樣化第一人稱互動的可泛化先驗進行訓練後，EgoPhys在重建、未來預測與零樣本泛化上均優於基線方法。為支援訓練與評估，我們整理了一個涵蓋多種可變形物體、場景與操作風格的第一人稱互動資料集。我們將EgoPhys部署於真實xArm6機器人上，證明從單一第一人稱人類操作影片初始化的數位孿生可作為內部世界表徵，輔助可變形物體的規劃，凸顯僅含RGB的第一人稱觀測作為通往真實到仿真流水線的可擴展路徑。

English

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.