Being-0: ビジョン言語モデルとモジュラー型スキルを備えたヒューマノイドロボティックエージェント

要旨

現実世界における身体性を持ったタスクで人間レベルの性能を達成する自律ロボットエージェントの構築は、ヒューマノイドロボット研究の究極の目標である。近年の進展により、Foundation Models（FMs）を用いた高次認知機能と、ヒューマノイドロボットの低次スキル開発において大きな進歩が見られた。しかし、これらのコンポーネントを直接組み合わせると、長期タスクにおけるエラーの累積や異なるモジュール間のレイテンシのばらつきにより、堅牢性と効率性が低下する場合が多い。本論文では、FMとモジュール型スキルライブラリを統合した階層型エージェントフレームワーク「Being-0」を提案する。FMは、指示の理解、タスク計画、推論などの高次認知タスクを担当し、スキルライブラリは安定した移動と器用な操作を提供する低次制御を担う。これらのレベル間のギャップを埋めるため、軽量な視覚言語モデル（VLM）を活用した新たな「Connector」モジュールを提案する。Connectorは、言語ベースの計画を実行可能なスキルコマンドに変換し、移動と操作を動的に調整することで、FMの身体性を強化し、タスクの成功率を向上させる。FMを除くすべてのコンポーネントを低コストのオンボード計算デバイスに展開可能なBeing-0は、器用なハンドとアクティブビジョンを備えたフルサイズのヒューマノイドロボット上で効率的なリアルタイム性能を実現する。大規模な屋内環境における広範な実験を通じて、Being-0が困難なナビゲーションと操作のサブタスクを必要とする複雑な長期タスクを解決する有効性が示された。詳細と動画はhttps://beingbeyond.github.io/being-0を参照されたい。

English

Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long-horizon tasks and the varied latency of different modules. We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high-level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low-level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision-language model (VLM). The Connector enhances the FM's embodied capabilities by translating language-based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low-cost onboard computation devices, Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being-0's effectiveness in solving complex, long-horizon tasks that require challenging navigation and manipulation subtasks. For further details and videos, visit https://beingbeyond.github.io/being-0.

Being-0: ビジョン言語モデルとモジュラー型スキルを備えたヒューマノイドロボティックエージェント

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

要旨

Support