Guava：一种有效且通用的具身操控工具

摘要

在大规模视觉-语言数据上训练的语言模型已展现出作为具身智能体的强大潜力。通过工具使用来驾驭模型，将高层推理与感知、规划和控制等外部模块相结合，为端到端的视觉-语言-动作系统提供了一种有前景的替代方案。然而，目前尚不清楚何种工具套件能有效支持具身操作，以及此类工具套件能在多大程度上解锁各类推理模型的具身能力。为此，我们提出了Guava——一种通过系统探索智能体工作流、动作空间和观测空间设计空间而开发的具身工具使用框架。研究识别出有效具身智能体的三大关键要素：迭代感知-推理-动作循环、语义动作抽象以及多模态观测。为探究这些设计原则是否对小型模型同样具有普适性，我们构建了一套端到端训练流程，利用完全在仿真环境中采集的不足2000条轨迹，将具身操作能力蒸馏至一个40亿参数的开源模型中。在仿真与真实环境中的实验结果表明，该模型性能可与前沿专有模型相媲美，同时在未见物体、新颖指令及长时域任务上展现出强大的泛化能力。研究提示，精心设计的工具套件可作为具身操作的可扩展、模型无关接口，以极少量训练数据驱动紧凑型开源模型涌现出强大的具身能力。

English

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.