Being-0:一款配备视觉语言模型与模块化技能的人形机器人代理
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
March 16, 2025
作者: Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F. Karlsson, Zongqing Lu
cs.AI
摘要
構建能夠在現實世界具身任務中達到人類水平表現的自主機器人代理,是人形機器人研究的最終目標。近年來,基礎模型(FMs)在高層次認知能力以及人形機器人低層次技能開發方面取得了顯著進展。然而,直接將這些組件結合往往會導致在長時程任務中因錯誤累積以及不同模塊的延遲差異而出現魯棒性和效率低下的問題。我們提出了Being-0,這是一個分層代理框架,它將基礎模型與模塊化技能庫相結合。基礎模型負責高層次認知任務,如指令理解、任務規劃和推理,而技能庫則為低層次控制提供穩定的運動和靈巧的操作能力。為了彌合這些層次之間的差距,我們提出了一個由輕量級視覺語言模型(VLM)驅動的新型連接器模塊。該連接器通過將基於語言的計劃轉化為可執行的技能命令,並動態協調運動和操作來提高任務成功率,從而增強了基礎模型的具身能力。除了基礎模型外,所有組件均可部署在低成本的在線計算設備上,使得Being-0在配備靈巧手和主動視覺的全尺寸人形機器人上實現了高效、實時的表現。在大型室內環境中的廣泛實驗證明了Being-0在解決需要挑戰性導航和操作子任務的複雜長時程任務中的有效性。更多詳情和視頻,請訪問https://beingbeyond.github.io/being-0。
English
Building autonomous robotic agents capable of achieving human-level
performance in real-world embodied tasks is an ultimate goal in humanoid robot
research. Recent advances have made significant progress in high-level
cognition with Foundation Models (FMs) and low-level skill development for
humanoid robots. However, directly combining these components often results in
poor robustness and efficiency due to compounding errors in long-horizon tasks
and the varied latency of different modules. We introduce Being-0, a
hierarchical agent framework that integrates an FM with a modular skill
library. The FM handles high-level cognitive tasks such as instruction
understanding, task planning, and reasoning, while the skill library provides
stable locomotion and dexterous manipulation for low-level control. To bridge
the gap between these levels, we propose a novel Connector module, powered by a
lightweight vision-language model (VLM). The Connector enhances the FM's
embodied capabilities by translating language-based plans into actionable skill
commands and dynamically coordinating locomotion and manipulation to improve
task success. With all components, except the FM, deployable on low-cost
onboard computation devices, Being-0 achieves efficient, real-time performance
on a full-sized humanoid robot equipped with dexterous hands and active vision.
Extensive experiments in large indoor environments demonstrate Being-0's
effectiveness in solving complex, long-horizon tasks that require challenging
navigation and manipulation subtasks. For further details and videos, visit
https://beingbeyond.github.io/being-0.Summary
AI-Generated Summary