Guava: 효과적이고 보편적인 체화된 조작을 위한 하네스

초록

대규모 시각-언어 데이터로 학습된 언어 모델은 체화된 에이전트에 대한 강력한 잠재력을 입증해 왔다. 체화된 도구 사용을 통한 모델 활용은 고수준 추론과 지각, 계획, 제어를 위한 외부 모듈을 결합함으로써 종단간 시각-언어-행동 시스템에 대한 유망한 대안을 제공한다. 하지만 체화된 조작을 위한 효과적인 하네스가 무엇인지, 그리고 그러한 하네스가 다양한 추론 모델에서 체화된 능력을 어느 정도까지 이끌어낼 수 있는지는 여전히 불명확하다. 본 연구에서는 에이전트 워크플로우, 행동 공간, 관찰 공간의 설계 공간을 체계적으로 탐색하여 개발된 체화된 도구 사용을 위한 하네스 프레임워크인 Guava를 제시한다. 본 연구는 효과적인 체화된 에이전트를 위한 세 가지 핵심 요소, 즉 반복적 지각-추론-행동 루프, 의미론적 행동 추상화, 다중 양식 관찰을 식별한다. 이러한 설계 원칙이 소규모 모델에도 보편적인지 이해하기 위해, 전적으로 시뮬레이션에서 수집된 2K 미만의 궤적을 사용하여 체화된 조작 능력을 4B 오픈소스 모델에 증류하는 종단간 훈련 파이프라인을 개발한다. 시뮬레이션과 실제 환경 모두에서의 실험 결과는 최첨단 독점 모델에 필적하는 성능을 보여주면서, 보지 못한 객체, 새로운 명령, 장기 과제에 대한 강력한 일반화 능력을 나타낸다. 결과는 잘 설계된 하네스가 체화된 조작을 위한 확장 가능하고 모델에 구애받지 않는 인터페이스 역할을 하여, 최소한의 훈련 데이터로 컴팩트한 오픈소스 모델에서 강력한 창발적 체화 능력을 가능하게 함을 시사한다.

English

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.