τ-知识：基于非结构化知识的对话智能体评估框架

摘要

在知识密集型场景中，对话智能体日益普及，其正确行为依赖于在与用户实时交互过程中，从大规模、专有且非结构化的语料库中检索并应用领域特定知识。然而现有基准测试大多独立评估检索功能或工具使用能力，导致在长周期交互中缺乏对非结构化数据进行真实全面智能体评估的基准。我们推出τ-Knowledge——τ-Bench的扩展框架，用于评估智能体在需协调外部自然语言知识与工具输出以产生可验证、符合策略的状态变更的环境中的表现。新领域τ-Banking模拟真实金融科技客服工作流，要求智能体在执行工具介导的账户更新时，导航约700份相互关联的知识文档。无论是基于嵌入向量的检索还是终端搜索，即便配备高推理预算的前沿模型也仅能达到25.5%通过率¹，且可靠性在重复试验中急剧下降。智能体难以从高度互联的知识库中准确检索文档，也无法对复杂内部策略进行精确推理。总体而言，τ-Knowledge为开发面向人类部署场景中整合非结构化知识的智能体提供了真实测试平台。

English

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce τ-Knowledge, an extension of τ-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, τ-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only sim25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, τ-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.

τ-知识：基于非结构化知识的对话智能体评估框架

τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

摘要

Support