ComplexFuncBench：在長文本情境下探索多步驟和受限函數調用

摘要

透過即時應用程式介面（API）增強大型語言模型（LLMs）可以幫助生成更準確和最新的回應。然而，在真實世界情境中評估LLMs的函數呼叫能力仍未被充分探討，這是因為數據收集和評估的複雜性。在這項工作中，我們引入了ComplexFuncBench，這是一個涵蓋五個真實世界情境的複雜函數呼叫基準測試。與現有的基準測試相比，ComplexFuncBench包括多步驟和受限函數呼叫，需要長參數填寫、參數值推理和128k長內容。此外，我們提出了一個自動框架ComplexEval，用於定量評估複雜函數呼叫任務。通過全面的實驗，我們展示了最先進的LLMs在函數呼叫方面的不足之處，並提出了優化這些能力的未來方向。數據和程式碼可在https://github.com/THUDM/ComplexFuncBench找到。

English

Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at https://github.com/THUDM/ComplexFuncBench.

ComplexFuncBench：在長文本情境下探索多步驟和受限函數調用

ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

摘要

Support