构建统一框架评估大模型智能体的必要性

摘要

随着大语言模型（LLM）的出现，通用智能体的基础能力取得了突破性进展。然而，对这些智能体进行评估存在独特的挑战，使其有别于静态问答基准测试。我们发现当前智能体基准测试受到诸多外部因素的严重干扰，包括系统提示词、工具集配置和环境动态性。现有评估往往依赖碎片化的研究者自建框架，其中针对推理和工具使用的提示工程差异巨大，导致难以将性能提升归因于模型本身。此外，由于缺乏标准化的环境数据，常出现错误难以追溯、结果无法复现的问题。这种标准化缺失给该领域带来了严重的不公平性和不透明性。我们认为建立统一评估框架对推进智能体评估的严谨性至关重要。为此，我们提出了一项旨在实现智能体评估标准化的方案。

English

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current agent benchmarks are heavily confounded by extraneous factors, including system prompts, toolset configurations, and environmental dynamics. Existing evaluations often rely on fragmented, researcher-specific frameworks where the prompt engineering for reasoning and tool usage varies significantly, making it difficult to attribute performance gains to the model itself. Additionally, the lack of standardized environmental data leads to untraceable errors and non-reproducible results. This lack of standardization introduces substantial unfairness and opacity into the field. We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation. To this end, we introduce a proposal aimed at standardizing agent evaluation.

构建统一框架评估大模型智能体的必要性

The Necessity of a Unified Framework for LLM-Based Agent Evaluation

摘要

Support