ChatPaper.aiChatPaper

樣本、審慎檢視和擴展:通過擴展驗證實現有效的推論時間搜索

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

February 3, 2025
作者: Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi
cs.AI

摘要

基於取樣的搜尋是一種利用測試時計算的簡單範式,涉及生成多個候選回應並選擇最佳回應,通常是通過驗證每個回應的正確性來實現。本文研究了影響基於取樣的搜尋的擴展趨勢。我們的研究發現之一是,僅通過擴展一個僅使用隨機取樣和直接自我驗證的簡約實現,就可以實現持續的性能改進,例如,將Gemini v1.5 Pro模型的推理能力提升至流行基準測試中o1-Preview模型之上。我們部分歸因於基於取樣的搜尋的可擴展性,這是一種隱式擴展現象,其中取樣更大的回應池進一步提高了驗證準確性。我們進一步確定了兩個有用的原則,用於通過測試時計算來提高自我驗證能力:(1)跨回應比較提供了關於錯誤和幻覺位置的有用信號,(2)不同的模型輸出風格適用於不同情境,思維鏈對於推理是有用的,但更難驗證。我們還發現,儘管可以引出準確的驗證,但前沿模型展示了明顯薄弱的即插即用驗證能力,並引入了一個基準測試來衡量這些缺陷上的進展。
English
Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

Summary

AI-Generated Summary

PDF82February 5, 2025