穿越考验：重新评估智能体在陌生环境中的能力

摘要

随着智能体系统不断发展并在现实场景中广泛部署，对其实力的忠实评估需求日益增长。然而，当前的基准测试通常基于任务相对简单的热门应用，且聚焦于狭窄的能力维度而忽略了更广泛的方面，导致现代智能体在这些测试中表现趋于饱和，难以探知其局限性。为此，我们推出了GauntletBench——一个基于Web的基准测试，旨在评估智能体在挑战性场景中的泛化能力。该测试聚焦于三个未被充分探索的能力维度（时间感知、图形理解与3D推理），涵盖五个覆盖较少的专业应用领域（视频编辑器、工作流构建器、3D建模器、飞行分析器与电路设计器），每个领域包含20个视觉密集型任务（共100项）。我们的基准测试提供了一套模块化流水线，包括兼容开源与闭源智能体框架的环境、受控的Web应用、结构完善的任务套件，以及集成多种指标的自动化评估引擎。与普遍预期相反，实验结果表明，前沿智能体系统远未达到人类水平。即使是最先进的智能体，在GauntletBench上的成功率也仅为19.1%，凸显了其在被忽视的能力与泛化性方面的局限。相比之下，非专业的人类标注者在我们充满挑战性但切实可行的任务中实现了超过80%的成功率，揭示了当前智能体能力与复杂现实场景所需能力之间的显著差距。

English

As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for evaluating agent generalisation in challenging scenarios, focusing on three underexplored capabilities (temporal perception, graphical understanding, and 3D reasoning), across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, and Circuit Designer), each with 20 vision-intensive tasks (100 in total). Our benchmark provides a modular pipeline that comprises an environment compatible with both open- and closed-source agent frameworks, a controlled web-based application, a well-structured task suite, and an automated evaluation engine with diverse metrics. Contrary to widespread expectations, our empirical results reveal that frontier agentic systems remain far from achieving human-level performance. Even the state-of-the-art agent achieves only a 19.1% success rate on our GauntletBench, highlighting the limitations in these overlooked capabilities and generalisation. By comparison, non-expert human annotators achieve over 80% success on our challenging yet feasible tasks, revealing the substantial gap between current agent capabilities and those required for complex real-world scenarios.