X-Sim: 実世界からシミュレーションを経て実世界へ至るクロスエンボディメント学習

要旨

人間の動画はロボット操作ポリシーを訓練するためのスケーラブルな方法を提供しますが、標準的な模倣学習アルゴリズムに必要な動作ラベルが欠如しています。既存のクロスエンボディメントアプローチは、人間の動きをロボットの動作にマッピングしようとしますが、エンボディメントが大きく異なる場合にはしばしば失敗します。本研究では、物体の動きを密で転移可能な信号として利用し、ロボットポリシーを学習するためのリアル・ツー・シミュレーション・ツー・リアルフレームワークであるX-Simを提案します。X-Simは、RGBD人間動画からフォトリアリスティックなシミュレーションを再構築し、物体の軌跡を追跡して物体中心の報酬を定義することから始まります。これらの報酬は、シミュレーション内で強化学習（RL）ポリシーを訓練するために使用されます。学習されたポリシーは、さまざまな視点と照明でレンダリングされた合成ロールアウトを使用して、画像条件付き拡散ポリシーに蒸留されます。現実世界に転移するために、X-Simは、展開中に現実とシミュレーションの観測を整合させるオンラインドメイン適応技術を導入します。重要な点として、X-Simはロボットの遠隔操作データを一切必要としません。2つの環境で5つの操作タスクを評価し、以下の結果を示します：（1）手動追跡およびシミュレーション・ツー・リアルベースラインに対して平均30％のタスク進捗の向上、（2）10倍少ないデータ収集時間で行動クローニングと同等の性能、（3）新しいカメラ視点およびテスト時の変更に対する一般化。コードと動画はhttps://portal-cornell.github.io/X-Sim/で公開されています。

English

Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.

X-Sim: 実世界からシミュレーションを経て実世界へ至るクロスエンボディメント学習

X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

要旨

Support