AGENTIF: エージェントシナリオにおける大規模言語モデルの指示追従性能のベンチマーキング

要旨

大規模言語モデル（LLM）は、現実世界のエージェント的アプリケーションにおいて高度な能力を発揮しています。実用的なニーズに対応するため、LLMベースのエージェントを開発する研究が増えており、新たな課題が浮上しています。エージェント的シナリオでは、長い指示文と複雑な制約が頻繁に含まれるため、例えば拡張されたシステムプロンプトや詳細なツール仕様などが挙げられます。このような指示に従うことはエージェント的アプリケーションにおいて重要ですが、LLMがそれらを確実に遵守できるかどうかはまだ十分に検証されていません。本論文では、エージェント的シナリオにおけるLLMの指示遵守能力を体系的に評価するための最初のベンチマークであるAgentIFを紹介します。AgentIFは以下の3つの特徴を持ちます：（1）現実的：50の実世界のエージェント的アプリケーションから構築されています。（2）長い：平均1,723語、最大15,630語の指示文を含みます。（3）複雑：指示ごとに平均11.9の制約があり、ツール仕様や条件制約など多様な制約タイプをカバーしています。AgentIFを構築するために、産業用エージェントやオープンソースのエージェントシステムから50のエージェント的タスクにわたる707の人間による注釈付き指示文を収集しました。各指示文に対して、関連する制約と対応する評価指標（コードベース評価、LLMベース評価、ハイブリッドコード-LLM評価）を注釈しました。AgentIFを使用して、既存の先進的なLLMを体系的に評価しました。その結果、現在のモデルは一般的にパフォーマンスが低く、特に複雑な制約構造やツール仕様の処理において課題があることが観察されました。さらに、指示文の長さやメタ制約に関するエラー分析と分析実験を行い、既存のLLMの失敗モードについていくつかの知見を提供しました。今後の研究を促進するため、コードとデータを公開しています。

English

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

AGENTIF: エージェントシナリオにおける大規模言語モデルの指示追従性能のベンチマーキング

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

要旨

Support