InfantAgent-Next:一款用於自動化電腦操作的多模態通用代理
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
May 16, 2025
作者: Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding
cs.AI
摘要
本文介紹了InfantAgent-Next,這是一個能夠以多模態方式與電腦互動的通用型代理,涵蓋文本、圖像、音頻和視頻。與現有方法不同,這些方法要么圍繞單一大型模型構建複雜的工作流程,要么僅提供工作流程的模塊化,我們的代理在高度模塊化的架構中整合了基於工具和純視覺的代理,使不同模型能夠以逐步的方式協同解決解耦的任務。我們的通用性不僅體現在能夠評估純視覺的現實世界基準(即OSWorld),還能夠評估更通用或工具密集型的基準(例如GAIA和SWE-Bench)。具體而言,我們在OSWorld上達到了7.27%的準確率,高於Claude-Computer-Use。代碼和評估腳本已在https://github.com/bin123apple/InfantAgent開源。
English
This paper introduces InfantAgent-Next, a generalist agent capable
of interacting with computers in a multimodal manner, encompassing text,
images, audio, and video. Unlike existing approaches that either build
intricate workflows around a single large model or only provide workflow
modularity, our agent integrates tool-based and pure vision agents within a
highly modular architecture, enabling different models to collaboratively solve
decoupled tasks in a step-by-step manner. Our generality is demonstrated by our
ability to evaluate not only pure vision-based real-world benchmarks (i.e.,
OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and
SWE-Bench). Specifically, we achieve 7.27% accuracy on OSWorld,
higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced
at https://github.com/bin123apple/InfantAgent.Summary
AI-Generated Summary