FineVerify：以細粒度自我驗證擴展測試時計算之智能體搜索

摘要

代理式搜尋要求語言模型代理探索多個來源，並回答複雜的資訊尋求問題。擴展測試時計算是提升此類代理的一種有前景的方式，但當前方法可能失敗，因為正確答案往往稀疏，而基於分數的選擇又依賴於模型校準。我們提出 FineVerify，一個細粒度的自我驗證框架，將每個問題分解為可檢查的子問題，根據每個子問題驗證抽樣的候選項，並選取聚合分數最高的候選項。這種逐項檢查的結構將選擇轉化為更簡單的局部判斷，並在相同的明確標準下產生成績。在四個代理式搜尋基準測試和兩個模型上，FineVerify 始終優於標準的擴展基線。僅需四條抽樣軌跡，它便使 GPT-5-mini 平均提升 8.2 個準確率百分點，使 Gemini-3-flash 平均提升 5.6%。使用 12 個樣本時，FineVerify 使 GPT-5-mini 在 BrowseComp-Plus 上超越了前沿的 GPT-5。除了準確性，FineVerify 還能產生可解釋的驗證痕跡，有助於審計基準錯誤，暗示了其在檢查代理式搜尋系統方面的更廣泛應用。程式碼與資料可在 https://github.com/XuZhao0/fineverify 取得。

English

Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve these agents, but current approaches can fail, because correct answers are often sparse and score-based selection depends on model calibration. We propose FineVerify, a fine-grained self-verification framework that decomposes each question into checkable sub-questions, verifies sampled candidates against each sub-question, and selects the candidate with the highest aggregated score. This per-check structure turns selection into simpler local judgments and produces scores under the same explicit criteria. Across four agentic search benchmarks and two models, FineVerify consistently outperforms standard scaling baselines. With only four sampled trajectories, it improves GPT-5-mini by 8.2 accuracy points and Gemini-3-flash by 5.6% on average. With 12 samples, FineVerify enables GPT-5-mini to surpass frontier GPT-5 on BrowseComp-Plus. Beyond accuracy, FineVerify produces interpretable verification traces that help audit benchmark errors, suggesting broader applications for inspecting agentic search systems. Code and data are available at https://github.com/XuZhao0/fineverify