InfantAgent-Next:面向自动化计算机交互的多模态通用智能体
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
May 16, 2025
作者: Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding
cs.AI
摘要
本文介绍了InfantAgent-Next,这是一种能够以多模态方式与计算机交互的通用型智能体,涵盖文本、图像、音频和视频。与现有方法不同,这些方法要么围绕单一大型模型构建复杂的工作流程,要么仅提供工作流程的模块化,我们的智能体在高度模块化的架构中集成了基于工具和纯视觉的智能体,使得不同模型能够以逐步协作的方式解决解耦任务。我们的通用性体现在不仅能够评估纯视觉的现实世界基准(如OSWorld),还能评估更通用或工具密集型的基准(例如GAIA和SWE-Bench)。具体而言,我们在OSWorld上达到了7.27%的准确率,高于Claude-Computer-Use。代码和评估脚本已在https://github.com/bin123apple/InfantAgent开源。
English
This paper introduces InfantAgent-Next, a generalist agent capable
of interacting with computers in a multimodal manner, encompassing text,
images, audio, and video. Unlike existing approaches that either build
intricate workflows around a single large model or only provide workflow
modularity, our agent integrates tool-based and pure vision agents within a
highly modular architecture, enabling different models to collaboratively solve
decoupled tasks in a step-by-step manner. Our generality is demonstrated by our
ability to evaluate not only pure vision-based real-world benchmarks (i.e.,
OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and
SWE-Bench). Specifically, we achieve 7.27% accuracy on OSWorld,
higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced
at https://github.com/bin123apple/InfantAgent.Summary
AI-Generated Summary