Being-0：一款配备视觉语言模型与模块化技能的人形机器人代理

摘要

构建能够在现实世界具身任务中达到人类水平表现的自主机器人代理，是人形机器人研究的终极目标。近期进展在基础模型（FMs）的高层次认知能力及人形机器人低层次技能开发方面取得了显著进步。然而，直接将这两部分结合往往导致在长时程任务中因错误累积及不同模块的延迟差异而表现出较差的鲁棒性和效率。我们提出了Being-0，一个将FM与模块化技能库集成的层次化代理框架。FM负责高层次认知任务，如指令理解、任务规划与推理，而技能库则为低层次控制提供稳定的运动与灵巧操作能力。为弥合这两层间的鸿沟，我们引入了一个由轻量级视觉语言模型（VLM）驱动的新型连接器模块。该连接器通过将基于语言的计划转化为可执行的技能指令，并动态协调运动与操作以提升任务成功率，从而增强了FM的具身能力。除FM外，所有组件均可部署于低成本机载计算设备上，使Being-0在配备灵巧手与主动视觉系统的全尺寸人形机器人上实现了高效、实时的性能表现。在大型室内环境中的大量实验验证了Being-0在解决需要复杂导航与操作子任务的长期任务中的有效性。更多详情与视频，请访问https://beingbeyond.github.io/being-0。

English

Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long-horizon tasks and the varied latency of different modules. We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high-level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low-level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision-language model (VLM). The Connector enhances the FM's embodied capabilities by translating language-based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low-cost onboard computation devices, Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being-0's effectiveness in solving complex, long-horizon tasks that require challenging navigation and manipulation subtasks. For further details and videos, visit https://beingbeyond.github.io/being-0.

Being-0：一款配备视觉语言模型与模块化技能的人形机器人代理

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

摘要

Support