AGENTIF：大型語言模型在代理場景中的指令遵循基準測試

摘要

大型語言模型（LLMs）在現實世界的代理應用中展現了先進的能力。日益增長的研究努力旨在開發基於LLM的代理以滿足實際需求，這引入了一個新的挑戰：代理場景通常涉及冗長且帶有複雜約束的指令，例如擴展的系統提示和詳細的工具規範。雖然遵循這些指令對於代理應用至關重要，但LLMs能否可靠地遵循它們仍未被充分探討。在本文中，我們介紹了AgentIF，這是第一個系統評估LLM在代理場景中指令遵循能力的基準。AgentIF具有三個關鍵特徵：(1) 真實性，由50個現實世界的代理應用構建而成。(2) 長度，平均1,723字，最多達15,630字。(3) 複雜性，每條指令平均包含11.9個約束，涵蓋多種約束類型，如工具規範和條件約束。為了構建AgentIF，我們從工業應用代理和開源代理系統中收集了50個代理任務的707條人工註釋指令。對於每條指令，我們註釋了相關的約束和相應的評估指標，包括基於代碼的評估、基於LLM的評估以及混合代碼-LLM評估。我們使用AgentIF系統地評估了現有的先進LLMs。我們觀察到，當前模型普遍表現不佳，特別是在處理複雜的約束結構和工具規範時。我們進一步對指令長度和元約束進行了錯誤分析和分析性實驗，提供了一些關於現有LLMs失敗模式的發現。我們已發布代碼和數據，以促進未來的研究。

English

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

AGENTIF：大型語言模型在代理場景中的指令遵循基準測試

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

摘要

Support