AGENTIF：面向智能体场景的大语言模型指令遵循能力基准测试

摘要

大型语言模型（LLMs）在现实世界的代理应用中展现了卓越能力。随着研究深入，开发基于LLM的代理以应对实际需求成为焦点，这带来了新挑战：代理场景常涉及包含复杂约束的冗长指令，如扩展的系统提示和详尽的工具规范。尽管遵循这些指令对代理应用至关重要，但LLM能否可靠执行仍待深入探究。本文中，我们提出了AgentIF，首个系统评估LLM在代理场景下指令遵循能力的基准。AgentIF具备三大特征：(1) 真实性，源自50个真实世界代理应用；(2) 长度，平均1723字，最长可达15630字；(3) 复杂性，每条指令平均包含11.9个约束，涵盖工具规范、条件约束等多种类型。构建AgentIF过程中，我们从工业应用代理和开源代理系统中收集了50个代理任务的707条人工标注指令。每条指令均标注了相关约束及对应的评估指标，包括基于代码的评估、基于LLM的评估及代码-LLM混合评估。利用AgentIF，我们系统评估了现有先进LLM，发现当前模型普遍表现欠佳，尤其在处理复杂约束结构和工具规范方面。进一步，我们对指令长度和元约束进行了错误分析和实验研究，揭示了现有LLM的一些失败模式。我们已公开代码和数据，以促进未来研究。

English

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

AGENTIF：面向智能体场景的大语言模型指令遵循能力基准测试

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

摘要

Support