대규모 언어 모델의 함수 호출에서 명령어 수행 평가

초록

함수 호출은 대형 언어 모델의 핵심 기능으로, AI 에이전트에 필수적입니다. 기존 벤치마크인 Berkeley Function Calling Leaderboard(BFCL), tau^2-Bench(arXiv:2506.07982), ACEBench(arXiv:2501.12851) 등은 인자 정확성을 평가하지만, 매개변수 설명에 포함된 형식 지침(예: 값을 큰따옴표로 묶거나 ISO 날짜 형식 사용)을 준수하는지 테스트하지 않습니다. 우리는 IFEval(arXiv:2311.07911)에서 영감을 받아 함수 호출에서의 정확한 지침 준수를 평가하는 IFEval-FC 벤치마크를 소개합니다. IFEval-FC는 검증 가능한 형식을 JSON 스키마 설명 내에 직접 인코딩하며, 예를 들어 값에 구두점이 포함되지 않아야 한다는 등의 사항을 명시합니다. 이 벤치마크는 750개의 테스트 케이스로 구성되며, 각 케이스는 입력 매개변수 중 하나에 내장된 형식과 해당 사용자 쿼리를 포함하는 함수로 이루어져 있습니다. 평가는 완전히 알고리즘화되어 객관성, 재현성 및 확장성을 보장합니다. 우리의 결과는 GPT-5와 Claude 4.1 Opus를 포함한 최첨단 상용 모델조차도 기본적인 형식 규칙을 자주 준수하지 못함을 보여주며, 이는 실제 에이전트 시스템에서의 실질적인 한계를 강조합니다. 전체 코드베이스와 데이터는 https://github.com/Skripkon/IFEval-FC에서 공개적으로 이용 가능합니다.

English

Function calling is a core capability of large language models, essential for AI agents. Existing benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), tau^2-Bench (arXiv:2506.07982), and ACEBench (arXiv:2501.12851) evaluate argument correctness but do not test adherence to format instructions embedded in parameter descriptions, such as enclosing values in double quotes or using ISO date formats. We introduce IFEval-FC, a benchmark inspired by IFEval (arXiv:2311.07911) that assesses precise instruction following in function calling. IFEval-FC encodes verifiable formats directly within JSON schema descriptions, for example specifying that a value must not contain punctuation. It includes 750 test cases, each consisting of a function with an embedded format for one of its input parameters and a corresponding user query. Evaluation is fully algorithmic, ensuring objectivity, reproducibility, and scalability. Our results show that even state-of-the-art proprietary models, including GPT-5 and Claude 4.1 Opus, frequently fail to follow basic formatting rules, highlighting a practical limitation for real-world agent systems. The complete codebase and data are publicly available at https://github.com/Skripkon/IFEval-FC.

대규모 언어 모델의 함수 호출에서 명령어 수행 평가

Instruction-Following Evaluation in Function Calling for Large Language Models

초록

Support