ゼロショットSim-to-Real VLA強化のための物体中心残差強化学習

要旨

視覚言語行動（VLA）モデルは多様な操作タスクに汎化できるが、模倣学習に基づく方策は累積する実行誤差により、精密な物理的相互作用において脆弱なままであった。純粋にシミュレーションで訓練された強化学習方策は、実世界のVLAのロバスト性をゼロショットで向上させることができるか？凍結されたVLAの上に修正方策を学習する残差RLは自然な枠組みを提供するが、既存手法は根本的なシミュレーションから実世界へのジレンマに直面する。すなわち、特権状態手法は展開のために損失あり蒸留を必要とし、画像ベース手法は視覚領域ギャップに悩み、実世界RLはコストが高く安全でない。本稿では、オブジェクト中心の残差RLフレームワークを提案する。これにより、物体姿勢を用いてVLAの行動を洗練し、シミュレーションと実世界の間で一貫して転移可能なコンパクトな観測空間を実現する。さらに、二つの領域を整合させるため、同一の遠隔操作デモをシミュレーションで再生し、実世界VLAのシミュレーション対応版を訓練する。残差RL方策は、姿勢ノイズ注入とドロップアウトを用いてシミュレーションでのみ訓練され、実ロボットにゼロショットで転移される。実機のFranka Research 3 (FR3)ロボットを用いた5種類の操作タスクにおいて、本手法は成功率を42%から76%へとゼロショットで向上させる。さらに、改善されたロールアウトを再利用して、追加の遠隔操作なしにベースVLAを自己改善のために再訓練することも可能である。プロジェクトページ: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

English

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/