컴퓨터 사용 기반 학습을 위한 사용자 인터페이스 분해 및 합성 기반 확장

초록

그래픽 사용자 인터페이스(GUI) 그라운딩, 즉 자연어 명령을 그래픽 사용자 인터페이스의 특정 동작으로 매핑하는 능력은 컴퓨터 사용 에이전트 개발에서 여전히 중요한 병목 현상으로 남아 있습니다. 현재 벤치마크는 그라운딩 작업을 짧은 참조 표현으로 지나치게 단순화하여, 소프트웨어 상식, 레이아웃 이해, 세밀한 조작 능력이 필요한 실제 상호작용의 복잡성을 제대로 반영하지 못하고 있습니다. 이러한 한계를 해결하기 위해, 우리는 텍스트 매칭, 요소 인식, 레이아웃 이해, 정밀 조작 등 다양한 작업 유형을 포함하는 564개의 세밀하게 주석이 달린 샘플로 구성된 종합 벤치마크인 OSWorld-G를 소개합니다. 또한, 우리는 작업을 다중 관점에서 분해하여 400만 개의 예시를 포함한 가장 큰 컴퓨터 사용 그라운딩 데이터셋 Jedi를 합성 및 공개합니다. Jedi로 훈련된 우리의 다중 스케일 모델은 ScreenSpot-v2, ScreenSpot-Pro 및 우리의 OSWorld-G에서 기존 접근법을 능가하며 그 효과를 입증합니다. 더 나아가, 우리는 Jedi를 통한 개선된 그라운딩이 복잡한 컴퓨터 작업에서 일반 기반 모델의 에이전트 능력을 직접 향상시켜 OSWorld에서 5%에서 27%로 성능이 개선됨을 보여줍니다. 상세한 어블레이션 연구를 통해, 우리는 그라운딩 성능에 기여하는 주요 요소를 식별하고, 다양한 인터페이스 요소에 대한 특화된 데이터를 결합함으로써 새로운 인터페이스에 대한 조합적 일반화가 가능함을 검증합니다. 모든 벤치마크, 데이터, 체크포인트 및 코드는 오픈소스로 제공되며 https://osworld-grounding.github.io에서 확인할 수 있습니다.

English

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.