試練をくぐり抜けて：慣れ親しんだ環境を超えたエージェントの能力再評価

要旨

エージェントシステムが進化を続け、実世界のシナリオで広く展開されるにつれて、その能力を忠実に評価する需要が高まっています。しかしながら、現在のベンチマークは一般的に、比較的単純なタスクを含む人気アプリケーションに基づいて構築されており、焦点が狭い能力セットに当てられている一方、より広範な次元を見落としているため、現代のエージェントでは性能が飽和状態になり、その限界を探ることができていません。この問題に対処するため、我々はGauntletBenchを導入します。これは、挑戦的なシナリオにおけるエージェントの汎化能力を評価するためのウェブベースのベンチマークであり、3つの未開拓の能力（時間知覚、グラフィカル理解、3D推論）に焦点を当て、5つのあまりカバーされていない専門アプリケーション（ビデオ編集ツール、ワークフロービルダー、3Dモデラー、フライトアナライザー、回路設計ツール）にわたって、それぞれ20の視覚集約型タスク（合計100タスク）を設定しています。本ベンチマークは、オープンソースおよびクローズドソースの両方のエージェントフレームワークと互換性のある環境、制御されたウェブベースのアプリケーション、構造化されたタスクスイート、および多様な評価指標を備えた自動評価エンジンからなるモジュール式パイプラインを提供します。広く期待されていることとは反対に、我々の実証結果は、最先端のエージェントシステムが人間レベルの性能には程遠いことを明らかにしています。最高性能のエージェントでさえ、我々のGauntletBenchではわずか19.1%の成功率しか達成しておらず、これらの見落とされていた能力と汎化における限界を浮き彫りにしています。比較として、非専門家の人間アノテーターは、我々の挑戦的ではあるが実行可能なタスクにおいて80%以上の成功率を達成しており、現在のエージェントの能力と複雑な実世界シナリオに必要な能力との間の大きなギャップを明らかにしています。

English

As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for evaluating agent generalisation in challenging scenarios, focusing on three underexplored capabilities (temporal perception, graphical understanding, and 3D reasoning), across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, and Circuit Designer), each with 20 vision-intensive tasks (100 in total). Our benchmark provides a modular pipeline that comprises an environment compatible with both open- and closed-source agent frameworks, a controlled web-based application, a well-structured task suite, and an automated evaluation engine with diverse metrics. Contrary to widespread expectations, our empirical results reveal that frontier agentic systems remain far from achieving human-level performance. Even the state-of-the-art agent achieves only a 19.1% success rate on our GauntletBench, highlighting the limitations in these overlooked capabilities and generalisation. By comparison, non-expert human annotators achieve over 80% success on our challenging yet feasible tasks, revealing the substantial gap between current agent capabilities and those required for complex real-world scenarios.