AGENTIF: 에이전트 시나리오에서 대규모 언어 모델의 명령어 수행 능력 벤치마킹

초록

대형 언어 모델(LLMs)은 현실 세계의 에이전트 응용 프로그램에서 고급 능력을 보여주고 있습니다. 실질적인 수요를 해결하기 위해 LLM 기반 에이전트를 개발하려는 연구 노력이 증가하면서 새로운 과제가 대두되었습니다: 에이전트 시나리오는 종종 긴 지시문과 복잡한 제약 조건을 포함하며, 이는 확장된 시스템 프롬프트와 상세한 도구 명세와 같은 요소를 포함합니다. 이러한 지시문을 준수하는 것은 에이전트 응용 프로그램에 있어 매우 중요하지만, LLM이 이를 신뢰성 있게 따를 수 있는지에 대한 연구는 아직 미흡한 상태입니다. 본 논문에서는 에이전트 시나리오에서 LLM의 지시문 준수 능력을 체계적으로 평가하기 위한 첫 번째 벤치마크인 AgentIF를 소개합니다. AgentIF는 세 가지 주요 특징을 가지고 있습니다: (1) 현실적, 50개의 실제 에이전트 응용 프로그램에서 구성됨. (2) 길다, 평균 1,723단어, 최대 15,630단어. (3) 복잡, 지시문당 평균 11.9개의 제약 조건, 도구 명세 및 조건 제약과 같은 다양한 제약 유형을 포함. AgentIF를 구성하기 위해, 산업용 에이전트 및 오픈소스 에이전트 시스템에서 50개의 에이전트 작업에 걸쳐 707개의 인간 주석이 달린 지시문을 수집했습니다. 각 지시문에 대해 관련된 제약 조건과 해당 평가 지표를 주석으로 달았으며, 이는 코드 기반 평가, LLM 기반 평가, 그리고 하이브리드 코드-LLM 평가를 포함합니다. 우리는 AgentIF를 사용하여 기존의 고급 LLM을 체계적으로 평가했습니다. 현재의 모델들은 일반적으로 복잡한 제약 구조와 도구 명세를 처리하는 데 있어서 특히 낮은 성능을 보였습니다. 우리는 추가적으로 오류 분석과 지시문 길이 및 메타 제약 조건에 대한 분석 실험을 수행하여 기존 LLM의 실패 모드에 대한 몇 가지 발견을 제공했습니다. 향후 연구를 위해 코드와 데이터를 공개했습니다.

English

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

AGENTIF: 에이전트 시나리오에서 대규모 언어 모델의 명령어 수행 능력 벤치마킹

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

초록

Support