复用合成数据实现细粒度搜索智能体监督 (注:该标题采用学术论文常见的动宾结构,将"Repurposing"译为"复用"体现资源再利用概念,"Fine-grained"译为"细粒度"符合人工智能领域术语规范,同时通过"实现...监督"的动宾搭配保持学术语体的严谨性。)
Repurposing Synthetic Data for Fine-grained Search Agent Supervision
October 28, 2025
作者: Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
cs.AI
摘要
基于大语言模型的搜索代理正越来越多地通过以实体为中心的合成数据进行训练,以解决复杂的知识密集型任务。然而,当前主流的训练方法(如群组相对策略优化GRPO)丢弃了这些丰富的实体信息,转而依赖稀疏的结果导向型奖励。这一关键局限使其无法区分具有重要参考价值的"近正确"样本(即推理过程基本正确但最终答案存在缺陷的案例)与完全失败的案例,从而导致有价值的学习信号被丢弃。我们通过利用训练过程中被忽视的实体信息来解决这一问题。实证分析表明,智能体在推理过程中识别出的真实实体数量与最终答案准确性存在强正相关性。基于此发现,我们提出了实体感知群组相对策略优化(E-GRPO)新框架,该框架构建了密集的实体感知奖励函数。E-GRPO根据错误样本的实体匹配率为其分配相应部分奖励,使模型能够有效从这些"近正确"样本中学习。在多样化问答和深度研究基准测试上的实验表明,E-GRPO始终显著优于GRPO基线方法。进一步分析显示,E-GRPO不仅实现了更高的准确率,还诱导出更高效的推理策略——所需工具调用次数更少,这证明该方法为搜索代理对齐提供了更有效且样本效率更高的解决方案。
English
LLM-based search agents are increasingly trained on entity-centric synthetic
data to solve complex, knowledge-intensive tasks. However, prevailing training
methods like Group Relative Policy Optimization (GRPO) discard this rich entity
information, relying instead on sparse, outcome-based rewards. This critical
limitation renders them unable to distinguish informative "near-miss"
samples-those with substantially correct reasoning but a flawed final
answer-from complete failures, thus discarding valuable learning signals. We
address this by leveraging the very entities discarded during training. Our
empirical analysis reveals a strong positive correlation between the number of
ground-truth entities identified during an agent's reasoning process and final
answer accuracy. Building on this insight, we introduce Entity-aware Group
Relative Policy Optimization (E-GRPO), a novel framework that formulates a
dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect
samples proportional to their entity match rate, enabling the model to
effectively learn from these "near-misses". Experiments on diverse
question-answering (QA) and deep research benchmarks show that E-GRPO
consistently and significantly outperforms the GRPO baseline. Furthermore, our
analysis reveals that E-GRPO not only achieves superior accuracy but also
induces more efficient reasoning policies that require fewer tool calls,
demonstrating a more effective and sample-efficient approach to aligning search
agents.