ChatPaper.aiChatPaper

重新利用合成數據實現細粒度搜索代理監督 (注:該標題採用學術論文常見的動名詞短語結構,其中"Repurposing"譯為"重新利用"體現數據再利用的創新性,"Fine-grained"譯為"細粒度"符合人工智能領域術語規範,"Supervision"譯為"監督"準確傳達機器學習中監督學習的技術內涵。整體翻譯既保持技術準確性又符合中文標題簡潔性要求。)

Repurposing Synthetic Data for Fine-grained Search Agent Supervision

October 28, 2025
作者: Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
cs.AI

摘要

基於大型語言模型的搜尋代理器正日益透過以實體為中心的合成資料進行訓練,以解決複雜且知識密集的任務。然而,現行訓練方法如群組相對策略優化(GRPO)卻捨棄了這些豐富的實體資訊,僅依賴稀疏的結果導向獎勵機制。此關鍵缺陷使其無法區分具有重要資訊的「近似成功」樣本(即推理過程基本正確但最終答案有誤的案例)與完全失敗的案例,從而丟失了寶貴的學習信號。我們透過利用訓練過程中被忽視的實體資訊來解決此問題。實證分析顯示,代理器在推理過程中識別出的真實實體數量與最終答案準確率存在強烈正相關。基於此發現,我們提出實體感知群組相對策略優化(E-GRPO),該創新框架構建了一種密集的實體感知獎勵函數。E-GRPO會根據錯誤樣本的實體匹配率分配部分獎勵,使模型能從這些「近似成功」案例中有效學習。在多樣化的問答系統與深度研究基準測試中,E-GRPO均持續顯著超越GRPO基線模型。進一步分析表明,E-GRPO不僅能達成更高準確率,還能誘導出更高效的推理策略,減少工具調用次數,展現出對齊搜尋代理器更有效且具樣本效率的新途徑。
English
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
PDF232December 1, 2025