FineVerify: 通过细粒度自验证扩展测试时计算以用于智能体搜索

摘要

智能体搜索需要语言模型智能体探索多个信息来源并回答复杂的信息检索问题。扩展测试时计算是提升这些智能体性能的前景方法，但现有方案可能失效，因为正确答案往往稀疏，且基于得分的筛选依赖模型校准效果。我们提出FineVerify——一种细粒度自验证框架，该框架将每个问题分解为可核验的子问题，针对每个子问题验证采样候选答案，并选择聚合得分最高的候选答案。这种逐项核验结构将选择过程简化为局部判断，并在统一明确标准下生成得分。在四个智能体搜索基准测试和两个模型上的实验表明，FineVerify始终优于标准扩展基线方法。仅使用四条采样轨迹，FineVerify使GPT-5-mini平均提升8.2个准确率百分点，Gemini-3-flash平均提升5.6%。使用12条采样时，FineVerify使GPT-5-mini在BrowseComp-Plus上超越前沿模型GPT-5。除准确率提升外，FineVerify还生成可解释的验证轨迹，有助于审计基准测试错误，这预示着其在审查智能体搜索系统方面具有更广泛的应用前景。代码与数据已开源至https://github.com/XuZhao0/fineverify

English

Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve these agents, but current approaches can fail, because correct answers are often sparse and score-based selection depends on model calibration. We propose FineVerify, a fine-grained self-verification framework that decomposes each question into checkable sub-questions, verifies sampled candidates against each sub-question, and selects the candidate with the highest aggregated score. This per-check structure turns selection into simpler local judgments and produces scores under the same explicit criteria. Across four agentic search benchmarks and two models, FineVerify consistently outperforms standard scaling baselines. With only four sampled trajectories, it improves GPT-5-mini by 8.2 accuracy points and Gemini-3-flash by 5.6% on average. With 12 samples, FineVerify enables GPT-5-mini to surpass frontier GPT-5 on BrowseComp-Plus. Beyond accuracy, FineVerify produces interpretable verification traces that help audit benchmark errors, suggesting broader applications for inspecting agentic search systems. Code and data are available at https://github.com/XuZhao0/fineverify