InfantAgent-Next：一款用於自動化電腦操作的多模態通用代理

摘要

本文介紹了InfantAgent-Next，這是一個能夠以多模態方式與電腦互動的通用型代理，涵蓋文本、圖像、音頻和視頻。與現有方法不同，這些方法要么圍繞單一大型模型構建複雜的工作流程，要么僅提供工作流程的模塊化，我們的代理在高度模塊化的架構中整合了基於工具和純視覺的代理，使不同模型能夠以逐步的方式協同解決解耦的任務。我們的通用性不僅體現在能夠評估純視覺的現實世界基準（即OSWorld），還能夠評估更通用或工具密集型的基準（例如GAIA和SWE-Bench）。具體而言，我們在OSWorld上達到了7.27%的準確率，高於Claude-Computer-Use。代碼和評估腳本已在https://github.com/bin123apple/InfantAgent開源。

English

This paper introduces InfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve 7.27% accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.