ToolCUA: 컴퓨터 사용 에이전트를 위한 최적의 GUI-도구 경로 조율

초록

컴퓨터 사용 에이전트(Computer Use Agents, CUAs)는 클릭 및 타이핑과 같은 원자적(atomic) GUI 동작과 API 기반 파일 연산과 같은 고수준 도구 호출을 통해 모두 동작할 수 있지만, 이러한 혼합 동작 공간은 종종 GUI 동작을 계속 수행할지 아니면 도구로 전환할지에 대한 불확실성을 초래하여 최적이 아닌 실행 경로를 낳는다. 이러한 어려움은 고품질의 교차된(interleaved) GUI-도구 궤적의 부족, 실제 도구 궤적 수집의 비용과 취약성, 그리고 GUI-도구 경로 선택에 대한 궤적 수준의 지도(supervision) 부족에서 비롯된다. 본 논문에서는 단계적 학습 패러다임을 통해 최적의 GUI-도구 경로 선택을 학습하도록 설계된 종단 간 에이전트인 ToolCUA를 제안한다. 먼저, 풍부한 정적(static) GUI 궤적을 재활용하고 근거 기반 도구 라이브러리(grounded tool library)를 합성하여 수동 엔지니어링이나 실제 도구 궤적 수집 없이도 다양한 GUI-도구 궤적을 가능하게 하는 교차된 GUI-도구 궤적 확장 파이프라인(Interleaved GUI-Tool Trajectory Scaling Pipeline)을 소개한다. 그 다음, 웜업 SFT(warmup SFT)와 단일 턴 RL(single-turn RL)을 결합한 도구-부트스트래핑 GUI RFT(Tool-Bootstrapped GUI RFT)를 수행하여 중요한 GUI-도구 전환 지점에서의 의사 결정을 개선한다. 마지막으로, 고충실도(high-fidelity) GUI-도구 환경에서 도구 효율적 경로 보상(Tool-Efficient Path Reward)의 지도를 받는 온라인 에이전트 RL(Online Agentic RL)을 사용하여 ToolCUA를 최적화하며, 이 보상은 적절한 도구 사용과 더 짧은 실행 경로를 장려한다. OSWorld-MCP 실험 결과, ToolCUA는 46.85%의 정확도를 달성하여 기준선 대비 약 66%의 상대적 개선을 보였으며, 유사한 규모의 모델들 중 최고 성능(state of the art)을 기록했다. 또한 GUI 전용 설정 대비 3.9% 향상되어 효과적인 GUI-도구 조정(orchestration)을 입증했다. 이러한 결과는 혼합 동작 공간에서의 훈련이 실제 세계 디지털 에이전트를 위한 유망한 패러다임임을 시사한다. 오픈소스: https://x-plug.github.io/ToolCUA/

English

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

ToolCUA: 컴퓨터 사용 에이전트를 위한 최적의 GUI-도구 경로 조율

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

초록

Support