RoboTwin 2.0: 強力なドメインランダム化を備えたスケーラブルなデータ生成器とベンチマークによるロバストな両手ロボット操作

要旨

シミュレーションベースのデータ合成は、現実世界のロボット操作を強化するための強力なパラダイムとして登場している。しかし、既存の合成データセットは、二つの課題により、堅牢な両手操作には不十分である：(1) 新しいタスクに対する効率的でスケーラブルなデータ生成手法の欠如、(2) 現実世界の複雑さを捉えられない過度に単純化されたシミュレーション環境。本論文では、多様で現実的なデータの自動的かつ大規模な生成を可能にするスケーラブルなシミュレーションフレームワーク「RoboTwin 2.0」を提案し、両手操作のための統一的な評価プロトコルを提供する。まず、147カテゴリーにわたる731インスタンスからなる大規模オブジェクトライブラリ「RoboTwin-OD」を構築し、各インスタンスに意味的および操作関連のラベルを付与する。この基盤を基に、マルチモーダル大規模言語モデル（MLLMs）とシミュレーションインザループの改良を組み合わせたエキスパートデータ合成パイプラインを開発し、タスクレベルの実行コードを自動生成する。シミュレーションから現実への転移を改善するため、RoboTwin 2.0は、クラッター、照明、背景、テーブル高さ、言語指示の5軸にわたる構造化されたドメインランダム化を導入し、データの多様性とポリシーの堅牢性を向上させる。このフレームワークを5つのロボットエンボディメントにわたる50の両手タスクに適用し、100,000以上のドメインランダム化されたエキスパート軌跡を事前に収集する。実験結果は、コード生成の成功率が10.9%向上し、新しい現実世界のシナリオに対する汎化性能が改善されたことを示す。本データセットでファインチューニングされたVLAモデルは、未見の現実世界タスクにおいて367%の相対的改善（42.0% vs. 9.0%）を達成し、合成データのみでトレーニングされたゼロショットモデルは228%の相対的向上を示し、現実世界の監督なしで強い汎化性能を発揮する。本論文では、堅牢な両手操作のスケーラブルな研究を支援するため、データジェネレータ、ベンチマーク、データセット、およびコードを公開する。

English

Simulation-based data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two challenges: (1) the lack of an efficient, scalable data generation method for novel tasks, and (2) oversimplified simulation environments that fail to capture real-world complexity. We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. We first construct RoboTwin-OD, a large-scale object library comprising 731 instances across 147 categories, each annotated with semantic and manipulation-relevant labels. Building on this foundation, we develop an expert data synthesis pipeline that combines multimodal large language models (MLLMs) with simulation-in-the-loop refinement to generate task-level execution code automatically. To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby enhancing data diversity and policy robustness. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories. Empirical results show a 10.9% gain in code generation success and improved generalization to novel real-world scenarios. A VLA model fine-tuned on our dataset achieves a 367% relative improvement (42.0% vs. 9.0%) on unseen scene real-world tasks, while zero-shot models trained solely on our synthetic data achieve a 228% relative gain, highlighting strong generalization without real-world supervision. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation.

RoboTwin 2.0: 強力なドメインランダム化を備えたスケーラブルなデータ生成器とベンチマークによるロバストな両手ロボット操作

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

要旨

Support