ChatPaper.aiChatPaper

InfantAgent-Next:一款用於自動化電腦操作的多模態通用代理

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

May 16, 2025
作者: Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding
cs.AI

摘要

本文介紹了InfantAgent-Next,這是一個能夠以多模態方式與電腦互動的通用型代理,涵蓋文本、圖像、音頻和視頻。與現有方法不同,這些方法要么圍繞單一大型模型構建複雜的工作流程,要么僅提供工作流程的模塊化,我們的代理在高度模塊化的架構中整合了基於工具和純視覺的代理,使不同模型能夠以逐步的方式協同解決解耦的任務。我們的通用性不僅體現在能夠評估純視覺的現實世界基準(即OSWorld),還能夠評估更通用或工具密集型的基準(例如GAIA和SWE-Bench)。具體而言,我們在OSWorld上達到了7.27%的準確率,高於Claude-Computer-Use。代碼和評估腳本已在https://github.com/bin123apple/InfantAgent開源。
English
This paper introduces InfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve 7.27% accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.

Summary

AI-Generated Summary

PDF92May 27, 2025