掌握人形機器人末端效應器的開放詞彙視覺運動操控技術

摘要

人形機器人在開放環境中對任意物體進行視覺移動操作，需要精確的末端執行器控制能力，以及透過視覺輸入（如RGB-D影像）對場景具備泛化理解能力。現有方法基於真實世界的模仿學習，由於大規模訓練數據集獲取困難，其泛化能力受限。本文提出HERO新範式，結合大型視覺模型的強泛化能力與開放詞彙理解優勢，以及模擬訓練帶來的強控制性能，實現人形機器人的物體移動操作。我們通過設計精確的殘差感知末端執行器追蹤策略達成此目標，該策略融合經典機器人學與機器學習技術，包含：a) 利用逆運動學將殘差末端目標轉換為參考軌跡，b) 通過學習型神經網路正向運動學模型實現精確運動學計算，c) 目標調整機制，d) 重規劃功能。這些創新共同使末端執行器追蹤誤差降低3.2倍。我們運用此精確追蹤器構建模組化移動操作系統，利用開放詞彙大型視覺模型實現強視覺泛化能力。該系統能在從辦公室到咖啡廳等多樣真實環境中可靠操作各類日常物體（如馬克杯、蘋果、玩具），適應43公分至92公分高度的操作平面。通過模擬與真實環境的系統化模組測試及端到端驗證，我們證明了所提設計的有效性。我們相信本文的突破將為訓練人形機器人與日常物體互動開闢新途徑。

English

Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.

掌握人形機器人末端效應器的開放詞彙視覺運動操控技術

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

摘要

Support