SurgWorld:通过世界建模从视频中学习手术机器人策略
SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling
December 29, 2025
作者: Yufan He, Pengfei Guo, Mengya Xu, Zhaoshuo Li, Andriy Myronenko, Dillan Imans, Bingjie Liu, Dongren Yang, Mingxue Gu, Yongnan Ji, Yueming Jin, Ren Zhao, Baiyong Shen, Daguang Xu
cs.AI
摘要
數據匱乏仍是實現全自主手術機器人的根本障礙。儘管大規模視覺語言動作模型通過利用多領域配對視頻動作數據,在家庭和工業操作中展現出卓越的泛化能力,但手術機器人領域卻因缺乏同時包含視覺觀測與精確機器人運動學的數據集而發展受限。與此相反,雖然存在大量手術視頻資源,但其缺乏對應的動作標註,導致無法直接應用模仿學習或視覺語言動作模型訓練。本研究旨在通過從SurgWorld(專為手術物理人工智能構建的世界模型)學習策略模型來緩解這一難題。我們構建了專門針對手術機器人的外科動作文本對齊數據集,其中包含精細化的動作描述。基於最先進的物理人工智能世界模型和該數據集,我們開發的SurgeWorld能夠生成多樣化、可泛化且逼真的手術視頻。我們還首創性地採用逆動力學模型從合成手術視頻中推斷僞運動學數據,從而生成配對的合成視頻動作數據。實驗證明,在真實手術機器人平台上,通過這些增強數據訓練的手術視覺語言動作策略模型,其性能顯著優於僅使用真實示範數據訓練的模型。本研究通過利用海量未標註手術視頻和生成式世界建模,為自主手術技能獲取提供了可擴展路徑,從而為實現泛化性強且數據高效的手術機器人策略開啟了新大門。
English
Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.