RoboTwin 2.0：一個具備強領域隨機化的可擴展數據生成器與基準，用於穩健的雙手機器人操作

摘要

基於模擬的數據合成已成為增強現實世界機器人操作的有力範式。然而，現有的合成數據集在應對雙臂操作的魯棒性方面仍顯不足，主要面臨兩大挑戰：(1) 缺乏針對新任務的高效、可擴展數據生成方法；(2) 模擬環境過於簡化，未能捕捉現實世界的複雜性。我們提出了RoboTwin 2.0，這是一個可擴展的模擬框架，能夠自動化、大規模生成多樣且真實的數據，並提供雙臂操作的統一評估協議。首先，我們構建了RoboTwin-OD，這是一個包含147個類別共731個實例的大規模物體庫，每個實例均標註了語義及與操作相關的標籤。基於此，我們開發了一條專家數據合成流水線，結合多模態大語言模型（MLLMs）與模擬內循環優化，自動生成任務級別的執行代碼。為提升模擬到現實的遷移能力，RoboTwin 2.0引入了五個維度的結構化領域隨機化：雜物、光照、背景、桌面高度及語言指令，從而增強數據多樣性與策略魯棒性。我們在涵蓋五種機器人實體的50項雙臂任務中實例化了該框架，並預先收集了超過100,000條領域隨機化的專家軌跡。實驗結果顯示，代碼生成成功率提升了10.9%，並在面對新現實場景時展現出更好的泛化能力。基於我們數據集微調的VLA模型在未見場景的現實任務中實現了367%的相對提升（42.0%對比9.0%），而僅在合成數據上訓練的零樣本模型則獲得了228%的相對增益，凸顯了無需現實監督的強大泛化能力。我們公開了數據生成器、基準測試、數據集及代碼，以支持魯棒雙臂操作的可擴展研究。

English

Simulation-based data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two challenges: (1) the lack of an efficient, scalable data generation method for novel tasks, and (2) oversimplified simulation environments that fail to capture real-world complexity. We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. We first construct RoboTwin-OD, a large-scale object library comprising 731 instances across 147 categories, each annotated with semantic and manipulation-relevant labels. Building on this foundation, we develop an expert data synthesis pipeline that combines multimodal large language models (MLLMs) with simulation-in-the-loop refinement to generate task-level execution code automatically. To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby enhancing data diversity and policy robustness. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories. Empirical results show a 10.9% gain in code generation success and improved generalization to novel real-world scenarios. A VLA model fine-tuned on our dataset achieves a 367% relative improvement (42.0% vs. 9.0%) on unseen scene real-world tasks, while zero-shot models trained solely on our synthetic data achieve a 228% relative gain, highlighting strong generalization without real-world supervision. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation.

RoboTwin 2.0：一個具備強領域隨機化的可擴展數據生成器與基準，用於穩健的雙手機器人操作

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

摘要

Support