ComplexFuncBench:在長文本情境下探索多步驟和受限函數調用
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
January 17, 2025
作者: Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, Jie Tang
cs.AI
摘要
透過即時應用程式介面(API)增強大型語言模型(LLMs)可以幫助生成更準確和最新的回應。然而,在真實世界情境中評估LLMs的函數呼叫能力仍未被充分探討,這是因為數據收集和評估的複雜性。在這項工作中,我們引入了ComplexFuncBench,這是一個涵蓋五個真實世界情境的複雜函數呼叫基準測試。與現有的基準測試相比,ComplexFuncBench包括多步驟和受限函數呼叫,需要長參數填寫、參數值推理和128k長內容。此外,我們提出了一個自動框架ComplexEval,用於定量評估複雜函數呼叫任務。通過全面的實驗,我們展示了最先進的LLMs在函數呼叫方面的不足之處,並提出了優化這些能力的未來方向。數據和程式碼可在https://github.com/THUDM/ComplexFuncBench找到。
English
Enhancing large language models (LLMs) with real-time APIs can help generate
more accurate and up-to-date responses. However, evaluating the function
calling abilities of LLMs in real-world scenarios remains under-explored due to
the complexity of data collection and evaluation. In this work, we introduce
ComplexFuncBench, a benchmark for complex function calling across five
real-world scenarios. Compared to existing benchmarks, ComplexFuncBench
encompasses multi-step and constrained function calling, which requires
long-parameter filing, parameter value reasoning, and 128k long context.
Additionally, we propose an automatic framework, ComplexEval, for
quantitatively evaluating complex function calling tasks. Through comprehensive
experiments, we demonstrate the deficiencies of state-of-the-art LLMs in
function calling and suggest future directions for optimizing these
capabilities. The data and code are available at
https://github.com/THUDM/ComplexFuncBench.Summary
AI-Generated Summary