LearnAct: 통합 데모 벤치마크를 갖춘 Few-Shot 모바일 GUI 에이전트

초록

모바일 GUI 에이전트는 작업 자동화에 유망한 가능성을 보여주지만, 다양한 실제 시나리오에서 일반화 문제에 직면하고 있습니다. 대규모 데이터셋을 활용한 사전 학습 또는 미세 조정과 같은 전통적인 접근 방식은 모바일 애플리케이션의 다양성과 사용자별 작업에 대응하기 어려운 한계가 있습니다. 본 연구에서는 더 큰 데이터셋을 통해 보편적인 일반화를 추구하기보다는 인간의 시연을 통해 모바일 GUI 에이전트의 성능을 향상시키는 데 초점을 맞추어, 새로운 시나리오에서의 성능 개선을 목표로 합니다. 이를 실현하기 위해, 우리는 모바일 GUI 에이전트의 시연 기반 학습 연구를 위해 특별히 설계된 첫 번째 포괄적인 데이터셋인 LearnGUI를 소개합니다. 이 데이터셋은 2,252개의 오프라인 작업과 101개의 온라인 작업으로 구성되어 있으며, 고품질의 인간 시연 데이터를 포함하고 있습니다. 또한, 우리는 시연 데이터로부터 지식을 자동으로 추출하여 작업 완료를 강화하는 정교한 다중 에이전트 프레임워크인 LearnAct를 개발했습니다. 이 프레임워크는 지식 추출을 담당하는 DemoParser, 관련 지식 검색을 수행하는 KnowSeeker, 그리고 시연 기반 작업 실행을 담당하는 ActExecutor라는 세 가지 전문화된 에이전트를 통합합니다. 실험 결과, 오프라인 및 온라인 평가 모두에서 상당한 성능 향상을 확인했습니다. 오프라인 평가에서는 단일 시연만으로도 모델 성능이 향상되어 Gemini-1.5-Pro의 정확도가 19.3%에서 51.7%로 증가했습니다. 온라인 평가에서는 우리의 프레임워크가 UI-TARS-7B-SFT의 작업 성공률을 18.1%에서 32.8%로 향상시켰습니다. LearnAct 프레임워크와 LearnGUI 벤치마크는 시연 기반 학습이 더 적응적이고 개인화되며 배포 가능한 모바일 GUI 에이전트를 위한 유망한 방향임을 입증합니다.

English

Mobile GUI agents show promise in automating tasks but face generalization challenges in diverse real-world scenarios. Traditional approaches using pre-training or fine-tuning with massive datasets struggle with the diversity of mobile applications and user-specific tasks. We propose enhancing mobile GUI agent capabilities through human demonstrations, focusing on improving performance in unseen scenarios rather than pursuing universal generalization through larger datasets. To realize this paradigm, we introduce LearnGUI, the first comprehensive dataset specifically designed for studying demonstration-based learning in mobile GUI agents, comprising 2,252 offline tasks and 101 online tasks with high-quality human demonstrations. We further develop LearnAct, a sophisticated multi-agent framework that automatically extracts knowledge from demonstrations to enhance task completion. This framework integrates three specialized agents: DemoParser for knowledge extraction, KnowSeeker for relevant knowledge retrieval, and ActExecutor for demonstration-enhanced task execution. Our experimental results show significant performance gains in both offline and online evaluations. In offline assessments, a single demonstration improves model performance, increasing Gemini-1.5-Pro's accuracy from 19.3% to 51.7%. In online evaluations, our framework enhances UI-TARS-7B-SFT's task success rate from 18.1% to 32.8%. LearnAct framework and LearnGUI benchmark establish demonstration-based learning as a promising direction for more adaptable, personalized, and deployable mobile GUI agents.

LearnAct: 통합 데모 벤치마크를 갖춘 Few-Shot 모바일 GUI 에이전트

LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

초록

Support