IntellAgent：用於評估對話人工智慧系統的多智能體框架

摘要

大型語言模型（LLMs）正在改變人工智慧，演變為能夠進行自主規劃和執行的面向任務的系統。LLMs 的主要應用之一是對話人工智慧系統，必須處理多輪對話、整合特定領域的應用程式介面（APIs），並遵守嚴格的策略約束。然而，評估這些代理人仍然是一個重大挑戰，因為傳統方法無法捕捉現實世界互動的複雜性和變異性。我們引入了IntellAgent，這是一個可擴展的、開源的多代理框架，旨在全面評估對話人工智慧系統。IntellAgent 通過結合基於策略的圖形建模、真實事件生成和互動式用戶代理模擬，自動創建多樣化的合成基準。這種創新方法提供了細緻的診斷，解決了靜態和手動精心策劃的基準的粗粒度指標的限制。IntellAgent 代表了評估對話人工智慧的範式轉變。通過模擬現實的、多策略的情景，跨不同複雜性水平，IntellAgent 捕捉了代理人能力和策略約束微妙的相互作用。與傳統方法不同，它採用基於圖形的策略模型來表示關係、可能性和策略互動的複雜性，從而實現高度詳細的診斷。IntellAgent 還識別了關鍵的性能差距，提供了針對性優化的可行見解。其模塊化、開源的設計支持新領域、策略和APIs 的無縫集成，促進可重現性和社區合作。我們的研究結果表明，IntellAgent 作為一個有效的框架，有助於通過解決在研究和部署之間橋接的挑戰，推進對話人工智慧。該框架可在 https://github.com/plurai-ai/intellagent 上獲得。

English

Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent

IntellAgent：用於評估對話人工智慧系統的多智能體框架

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

摘要

Support