AppAgent: 스마트폰 사용자로서의 멀티모달 에이전트

초록

최근 대규모 언어 모델(LLM)의 발전으로 복잡한 작업을 수행할 수 있는 지능형 에이전트가 등장했습니다. 본 논문은 스마트폰 애플리케이션을 운영하도록 설계된 새로운 LLM 기반 멀티모달 에이전트 프레임워크를 소개합니다. 우리의 프레임워크는 에이전트가 탭이나 스와이프와 같은 인간과 유사한 상호작용을 통해 단순화된 액션 공간에서 스마트폰 애플리케이션을 작동할 수 있게 합니다. 이 새로운 접근 방식은 시스템 백엔드 접근의 필요성을 우회함으로써 다양한 앱에 걸쳐 적용 범위를 확장합니다. 우리 에이전트의 기능성에서 핵심은 혁신적인 학습 방법입니다. 에이전트는 자율 탐색 또는 인간의 데모를 관찰함으로써 새로운 앱을 탐색하고 사용하는 방법을 학습합니다. 이 과정은 에이전트가 다양한 애플리케이션에서 복잡한 작업을 실행하기 위해 참조하는 지식 기반을 생성합니다. 우리 에이전트의 실용성을 입증하기 위해 소셜 미디어, 이메일, 지도, 쇼핑, 정교한 이미지 편집 도구 등 10개의 서로 다른 애플리케이션에서 50개의 작업에 대한 광범위한 테스트를 수행했습니다. 결과는 우리 에이전트가 다양한 고수준 작업을 처리하는 데 능숙함을 확인시켜 줍니다.

English

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

AppAgent: 스마트폰 사용자로서의 멀티모달 에이전트

AppAgent: Multimodal Agents as Smartphone Users

초록

Support