大语言模型函数调用中的指令遵循评估

摘要

函数调用是大型语言模型的核心能力，对AI代理至关重要。现有的基准测试，如伯克利函数调用排行榜（BFCL）、tau^2-Bench（arXiv:2506.07982）和ACEBench（arXiv:2501.12851），主要评估参数的正确性，但未测试对参数描述中嵌入的格式指令的遵循情况，例如将值用双引号括起或使用ISO日期格式。我们引入了IFEval-FC，这是一个受IFEval（arXiv:2311.07911）启发的基准测试，用于评估函数调用中的精确指令遵循。IFEval-FC直接在JSON模式描述中编码可验证的格式，例如指定值不得包含标点符号。它包含750个测试用例，每个用例由一个函数及其输入参数中嵌入的格式以及相应的用户查询组成。评估完全基于算法，确保了客观性、可重复性和可扩展性。我们的结果表明，即使是包括GPT-5和Claude 4.1 Opus在内的最先进的专有模型，也经常无法遵循基本的格式规则，这突显了现实世界代理系统的一个实际限制。完整的代码库和数据公开在https://github.com/Skripkon/IFEval-FC。

English

Function calling is a core capability of large language models, essential for AI agents. Existing benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), tau^2-Bench (arXiv:2506.07982), and ACEBench (arXiv:2501.12851) evaluate argument correctness but do not test adherence to format instructions embedded in parameter descriptions, such as enclosing values in double quotes or using ISO date formats. We introduce IFEval-FC, a benchmark inspired by IFEval (arXiv:2311.07911) that assesses precise instruction following in function calling. IFEval-FC encodes verifiable formats directly within JSON schema descriptions, for example specifying that a value must not contain punctuation. It includes 750 test cases, each consisting of a function with an embedded format for one of its input parameters and a corresponding user query. Evaluation is fully algorithmic, ensuring objectivity, reproducibility, and scalability. Our results show that even state-of-the-art proprietary models, including GPT-5 and Claude 4.1 Opus, frequently fail to follow basic formatting rules, highlighting a practical limitation for real-world agent systems. The complete codebase and data are publicly available at https://github.com/Skripkon/IFEval-FC.

大语言模型函数调用中的指令遵循评估

Instruction-Following Evaluation in Function Calling for Large Language Models

摘要

Support