InfantAgent-Next: 自動コンピュータ操作のためのマルチモーダル汎用エージェント

要旨

本論文では、テキスト、画像、音声、動画を含むマルチモーダルな方法でコンピュータと対話可能な汎用エージェント「InfantAgent-Next」を紹介する。既存のアプローチが単一の大規模モデルを中心に複雑なワークフローを構築するか、ワークフローのモジュール性のみを提供するのに対し、本エージェントはツールベースと純粋な視覚エージェントを高度にモジュール化されたアーキテクチャ内に統合し、異なるモデルが段階的に分離されたタスクを協調して解決することを可能にする。本エージェントの汎用性は、純粋な視覚ベースの実世界ベンチマーク（OSWorld）だけでなく、より一般的またはツール集約的なベンチマーク（GAIAやSWE-Benchなど）も評価できる点で実証されている。具体的には、OSWorldにおいて7.27%の精度を達成し、Claude-Computer-Useを上回った。コードと評価スクリプトはhttps://github.com/bin123apple/InfantAgentで公開されている。

English

This paper introduces InfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve 7.27% accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.

InfantAgent-Next: 自動コンピュータ操作のためのマルチモーダル汎用エージェント

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

要旨

Support