InfantAgent-Next: 자동화된 컴퓨터 상호작용을 위한 멀티모달 일반 에이전트

초록

본 논문은 텍스트, 이미지, 오디오, 비디오를 포함한 다중 모드 방식으로 컴퓨터와 상호작용할 수 있는 범용 에이전트인 InfantAgent-Next를 소개합니다. 기존 접근 방식이 단일 대형 모델을 중심으로 복잡한 워크플로를 구축하거나 단순히 워크플로 모듈성을 제공하는 것과 달리, 본 에이전트는 도구 기반 에이전트와 순수 비전 에이전트를 고도로 모듈화된 아키텍처 내에 통합하여, 다양한 모델이 단계별로 분리된 작업을 협력적으로 해결할 수 있도록 합니다. 우리의 범용성은 순수 비전 기반의 실제 세계 벤치마크(예: OSWorld)뿐만 아니라 더 일반적이거나 도구 집약적인 벤치마크(예: GAIA 및 SWE-Bench)를 평가할 수 있는 능력으로 입증됩니다. 특히, OSWorld에서 7.27%의 정확도를 달성하여 Claude-Computer-Use를 능가했습니다. 코드와 평가 스크립트는 https://github.com/bin123apple/InfantAgent에서 오픈소스로 제공됩니다.

English

This paper introduces InfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve 7.27% accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.