Guava：一個有效且通用的具身操作框架

摘要

基於大規模視覺-語言數據訓練的語言模型已展現出具身智能體的強大潛力。透過具身工具使用來駕馭模型，透過將高層次推理與外部感知、規劃及控制模組相結合，提供了一種有別於端到端視覺-語言-行動系統的可行替代方案。然而，對於什麼因素能構成有效的具身操作框架，以及此種框架能在多大程度上解鎖各類推理模型的具身能力，目前仍不明確。在本研究中，我們提出Guava框架，這套用於具身工具使用的框架是透過系統性地探索智能體工作流程、行動空間與觀測空間的設計空間而開發的。我們的研究確立了有效具身智能體的三項關鍵要素：迭代的感知-推理-行動循環、語意化行動抽象，以及多模態觀測。為驗證這些設計原則是否對小型模型也具普遍適用性，我們開發了一套端到端訓練流程，能在模擬環境中僅用不到2,000條軌跡數據，即完成將具身操作能力提煉至4B參數開源模型的過程。在模擬環境與真實世界的實驗結果顯示，其效能可媲美前沿專有模型，同時對未見過的物體、新穎指令及長時程任務展現出強大的泛化能力。研究結果表明，設計良好的框架可作為具身操作的可擴展、模型無關介面，以極少量訓練數據，便能將強大的新興具身能力注入小型開源模型之中。

English

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.