ChatPaper.aiChatPaper

KARL:基于强化学习的知识智能体

KARL: Knowledge Agents via Reinforcement Learning

March 5, 2026
作者: Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu, Peyton DeNiro, Wen Sun, Michael Bendersky, Jonathan Frankle
cs.AI

摘要

我们提出了一种通过强化学习训练企业搜索智能体的系统,该系统在多样化难以验证的智能搜索任务套件中实现了最先进性能。我们的工作有四个核心贡献:首先,我们推出了KARLBench评估套件,涵盖六大搜索能力维度,包括约束驱动的实体搜索、跨文档报告合成、表格数值推理、穷尽式实体检索、技术文档程序性推理以及企业内部笔记的事实聚合。其次,我们证明跨异构搜索行为训练的模型比针对单一基准优化的模型具有显著更好的泛化能力。第三,我们开发了采用长程推理和工具使用的智能合成流程,通过逐步增强的模型进行迭代自举,生成多样化、有依据且高质量的训练数据。第四,我们提出基于迭代大批量离线策略RL的新型后训练范式,该范式具备样本高效性、对训练-推理引擎差异的鲁棒性,并能自然扩展到具有分布外泛化能力的多任务训练。与Claude 4.6和GPT 5.2相比,KARL在成本-质量和延迟-质量的权衡关系上均实现帕累托最优,包括训练期间未见过的分布外任务。在充足测试计算资源下,其性能超越最强的闭源模型。这些结果表明,定制化合成数据与多任务强化学习相结合,能够为基于事实的推理任务打造高性价比的高性能知识智能体。
English
We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal enterprise notes. Second, we show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark. Third, we develop an agentic synthesis pipeline that employs long-horizon reasoning and tool use to generate diverse, grounded, and high-quality training data, with iterative bootstrapping from increasingly capable models. Fourth, we propose a new post-training paradigm based on iterative large-batch off-policy RL that is sample efficient, robust to train-inference engine discrepancies, and naturally extends to multi-task training with out-of-distribution generalization. Compared to Claude 4.6 and GPT 5.2, KARL is Pareto-optimal on KARLBench across cost-quality and latency-quality trade-offs, including tasks that were out-of-distribution during training. With sufficient test-time compute, it surpasses the strongest closed models. These results show that tailored synthetic data in combination with multi-task reinforcement learning enables cost-efficient and high-performing knowledge agents for grounded reasoning.
PDF51March 9, 2026