ComplexFuncBench：長いコンテキストシナリオ下でのマルチステップおよび制約付き関数呼び出しの探索

要旨

大規模言語モデル（LLMs）をリアルタイムAPIで拡張することで、より正確で最新の応答を生成するのに役立ちます。ただし、LLMsの関数呼び出し能力を実世界のシナリオで評価することは、データ収集と評価の複雑さのために未だに未開拓の領域です。本研究では、複数の実世界シナリオでの複雑な関数呼び出しを対象としたベンチマークであるComplexFuncBenchを紹介します。既存のベンチマークと比較して、ComplexFuncBenchは複数段階および制約付きの関数呼び出しを包括し、長いパラメータの記入、パラメータ値の推論、および128kの長いコンテキストが必要です。さらに、複雑な関数呼び出しタスクを定量的に評価するための自動フレームワークであるComplexEvalを提案します。包括的な実験を通じて、最先端のLLMsの関数呼び出しにおける不足点を示し、これらの能力を最適化するための将来の方向性を提案します。データとコードは以下のリンクから入手可能です：https://github.com/THUDM/ComplexFuncBench。

English

Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at https://github.com/THUDM/ComplexFuncBench.

ComplexFuncBench：長いコンテキストシナリオ下でのマルチステップおよび制約付き関数呼び出しの探索

ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

要旨

Support