ChatPaper.aiChatPaper

KARL:基于强化学习的知识智能体

KARL: Knowledge Agents via Reinforcement Learning

March 5, 2026
作者: Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu, Peyton DeNiro, Wen Sun, Michael Bendersky, Jonathan Frankle
cs.AI

摘要

我们提出一种基于强化学习的企业搜索智能体训练系统,在多种难以验证的智能搜索任务上实现了最优性能。本研究包含四项核心贡献:首先,我们推出KARLBench多能力评估套件,涵盖六大搜索场景——约束驱动实体搜索、跨文档报告合成、表格数值推理、穷尽式实体检索、技术文档程序推理及企业内部笔记事实聚合。其次,我们证明跨异构搜索行为训练的模型比针对单一基准优化的模型具有显著更好的泛化能力。第三,我们开发了采用长程推理与工具使用的智能合成流程,通过能力迭代增强的模型自举生成多样化、有依据的高质量训练数据。第四,我们提出基于迭代大批量离线策略RL的新型后训练范式,该范式具备样本高效性、对训练-推理引擎差异的鲁棒性,并可自然扩展至具有分布外泛化能力的多任务训练。与Claude 4.6和GPT 5.2相比,KARL在成本-质量与延迟-质量的权衡曲线上均达到帕累托最优,包括训练时未见的分布外任务。在充足测试计算资源下,其性能超越最强的闭源模型。这些结果表明,定制化合成数据与多任务强化学习的结合,能够为基于事实的推理任务打造高性价比的高性能知识智能体。
English
We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal enterprise notes. Second, we show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark. Third, we develop an agentic synthesis pipeline that employs long-horizon reasoning and tool use to generate diverse, grounded, and high-quality training data, with iterative bootstrapping from increasingly capable models. Fourth, we propose a new post-training paradigm based on iterative large-batch off-policy RL that is sample efficient, robust to train-inference engine discrepancies, and naturally extends to multi-task training with out-of-distribution generalization. Compared to Claude 4.6 and GPT 5.2, KARL is Pareto-optimal on KARLBench across cost-quality and latency-quality trade-offs, including tasks that were out-of-distribution during training. With sufficient test-time compute, it surpasses the strongest closed models. These results show that tailored synthetic data in combination with multi-task reinforcement learning enables cost-efficient and high-performing knowledge agents for grounded reasoning.
PDF51March 9, 2026