EgoPhys：从自我中心视频学习可变形物体的可泛化物理模型

摘要

人类通过日常互动自然地理解物体物理特性，但准确预测弹性材料和织物等复杂可变形动力学仍是计算机视觉与机器人学面临的重大挑战。我们提出EgoPhys框架，该框架利用可泛化先验知识，仅从第一人称视角RGB视频构建可变形物理数字孪生。EgoPhys通过将每个物体的逆物理求解结果提炼为紧凑码本，克服现有方法局限，实现从第一人称视角视频生成可控可变形数字孪生，且无需在测试阶段对每个弹簧进行优化即可预测未见物体的密集弹簧刚度场。该框架通过多样化第一人称视角互动数据中的可泛化先验进行训练，在重建、未来预测及零样本泛化方面均优于基线方法。为支持训练与评估，我们构建了涵盖多种可变形物体、场景及操作风格的第一人称视角互动数据集。在真实xArm6机器人上部署EgoPhys后，我们发现通过单段第一人称视角人类操作视频初始化的数字孪生，可作为内部世界表征辅助可变形物体规划，凸显了第一人称视角RGB观测在构建从真实到仿真流程中的可扩展路径。

English

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.