SurgWorld:基于世界建模从视频中学习手术机器人策略
SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling
December 29, 2025
作者: Yufan He, Pengfei Guo, Mengya Xu, Zhaoshuo Li, Andriy Myronenko, Dillan Imans, Bingjie Liu, Dongren Yang, Mingxue Gu, Yongnan Ji, Yueming Jin, Ren Zhao, Baiyong Shen, Daguang Xu
cs.AI
摘要
数据稀缺仍是实现全自主手术机器人的根本障碍。尽管大规模视觉语言动作模型通过利用多领域配对视频动作数据,在家庭和工业操作中展现出卓越的泛化能力,但手术机器人领域却因缺乏同时包含视觉观察与精确机器人运动学的数据集而发展受限。与之形成对比的是,虽然存在海量手术视频资源,但它们缺少对应的动作标签,导致无法直接应用模仿学习或VLA训练。本研究旨在通过从SurgWorld(专为手术物理AI设计的世界模型)学习策略模型来缓解这一问题。我们构建了专门针对手术机器人的手术动作文本对齐数据集,该数据集包含精细化的动作描述。基于最先进的物理AI世界模型和SATA数据集,我们开发了能够生成多样化、可泛化且逼真手术视频的SurgeWorld平台。我们首次采用逆动力学模型从合成手术视频中推断伪运动学数据,从而生成配对的合成视频动作数据。实验证明,在真实手术机器人平台上,采用增强数据训练的手术VLA策略模型性能显著优于仅使用真实演示数据训练的模型。该方法通过利用未标注手术视频资源与生成式世界建模,为自主手术技能获取提供了可扩展路径,从而为开发具有泛化能力和数据高效性的手术机器人策略开辟了新途径。
English
Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.