インタラクティブ・エージェント基盤モデル

要旨

人工知能システムの開発は、静的なタスク特化型モデルから、幅広いアプリケーションにおいて優れた性能を発揮できる動的でエージェントベースのシステムへと移行しつつある。本研究では、多様な領域、データセット、タスクにわたってAIエージェントを訓練するための新たなマルチタスクエージェント訓練パラダイムを用いたインタラクティブエージェント基盤モデルを提案する。我々の訓練パラダイムは、視覚的マスク化オートエンコーダ、言語モデリング、次行動予測といった多様な事前学習戦略を統合し、汎用性と適応性を備えたAIフレームワークを実現する。本フレームワークの性能を、ロボティクス、ゲームAI、ヘルスケアという3つの異なる領域で実証し、各領域において意味的かつ文脈的に適切な出力を生成する能力を示す。我々のアプローチの強みは、ロボティクスシーケンス、ゲームプレイデータ、大規模映像データセット、テキスト情報といった多様なデータソースを活用し、効果的なマルチモーダルおよびマルチタスク学習を可能にする汎用性にある。本アプローチは、汎用的で行動指向のマルチモーダルシステムを開発するための有望な道筋を提供する。

English

The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.

インタラクティブ・エージェント基盤モデル

An Interactive Agent Foundation Model

要旨

Support