單一模型能否同時掌握多輪對話與工具使用？ CALM：統一的對話式代理語言模型

摘要

具備API調用能力的大型語言模型（LLMs）不僅促成了高效語言代理（LA）的構建，也徹底革新了傳統的任務導向對話（TOD）範式。然而，當前方法面臨一個關鍵困境：TOD系統通常僅在有限的目標API集上進行訓練，需新增數據以維持其與新服務交互時的品質；而LA則未經訓練以在多輪對話中保持用戶意圖。鑑於穩健的多輪對話管理與高級函數調用對於高效對話代理至關重要，我們在三個主流基準測試上評估了這些能力：MultiWOZ 2.4（TOD）、BFCL V3（LA）及API-Bank（LA），分析結果顯示，專精於某一領域的方法在另一領域表現欠佳。為彌合此鴻溝，我們提出了CALM（會話式代理語言模型），一種融合對話與代理能力的統一方法。我們創建了CALM-IT，這是一個精心構建的多任務數據集，其中交織了多輪ReAct推理與複雜API使用。利用CALM-IT，我們訓練了三個模型——CALM 8B、CALM 70B及CALM 405B，它們在所有三個基準測試上均超越了包括GPT-4o在內的頂尖領域專用模型。

English

Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA), and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CALM-IT, we train three models CALM 8B, CALM 70B, and CALM 405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks.