Guava: 身体的操作のための効果的で普遍的なハーネス

要旨

大規模な視覚言語データで学習された言語モデルは、身体性エージェントにおいて強力な可能性を示している。身体的なツール使用を通じてモデルを活用することは、高レベルの推論と知覚、計画、制御のための外部モジュールを組み合わせることで、エンドツーエンドの視覚言語行動システムに代わる有望な方法を提供する。しかしながら、身体的操作において効果的なハーネスとは何か、またそのようなハーネスが幅広い推論モデルの身体的能力をどの程度まで解放できるかは、依然として明らかではない。本稿では、エージェントのワークフロー、行動空間、観測空間の設計空間を系統的に探求することにより開発された、身体的なツール使用のためのハーネスフレームワークであるGuavaを紹介する。本研究では、効果的な身体性エージェントのための三つの重要な要素、すなわち反復的な知覚・推論・行動ループ、意味的な行動抽象化、およびマルチモーダル観測を特定する。これらの設計原則が小型モデルにおいても普遍的であるかを理解するため、我々は完全にシミュレーション内で収集された2000未満の軌道を用いて、身体的操作能力を4Bのオープンソースモデルに蒸留するエンドツーエンドの学習パイプラインを開発する。シミュレーション環境と実世界環境の両方における実験結果は、先端的なプロプライエタリモデルに匹敵する性能を示すとともに、未知の物体、新しい指示、長期的タスクに対する強い汎化を示す。これらの結果は、適切に設計されたハーネスが身体的操作のためのスケーラブルでモデルに依存しないインターフェースとして機能し、最小限の学習データでコンパクトなオープンソースモデルに強力な創発的身体能力をもたらすことを示唆している。

English

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.