樣本、審慎檢視和擴展：通過擴展驗證實現有效的推論時間搜索

摘要

基於取樣的搜尋是一種利用測試時計算的簡單範式，涉及生成多個候選回應並選擇最佳回應，通常是通過驗證每個回應的正確性來實現。本文研究了影響基於取樣的搜尋的擴展趨勢。我們的研究發現之一是，僅通過擴展一個僅使用隨機取樣和直接自我驗證的簡約實現，就可以實現持續的性能改進，例如，將Gemini v1.5 Pro模型的推理能力提升至流行基準測試中o1-Preview模型之上。我們部分歸因於基於取樣的搜尋的可擴展性，這是一種隱式擴展現象，其中取樣更大的回應池進一步提高了驗證準確性。我們進一步確定了兩個有用的原則，用於通過測試時計算來提高自我驗證能力：（1）跨回應比較提供了關於錯誤和幻覺位置的有用信號，（2）不同的模型輸出風格適用於不同情境，思維鏈對於推理是有用的，但更難驗證。我們還發現，儘管可以引出準確的驗證，但前沿模型展示了明顯薄弱的即插即用驗證能力，並引入了一個基準測試來衡量這些缺陷上的進展。

English

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

樣本、審慎檢視和擴展：通過擴展驗證實現有效的推論時間搜索

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

摘要

Support