獎勵模型通過以準確性換取吞吐量，實現了可擴展的程式碼驗證。

摘要

利用大型語言模型（LLMs）解決編碼任務的標準範式是先生成再排序程序，其中排序過程使用驗證器。當前普遍共識認為，只要可能，應優先考慮全面驗證器（如完整測試套件）而非結果獎勵模型（ORM），而很少考慮其中的權衡。我們旨在通過系統探索速度與準確性之間的權衡來挑戰這一假設。我們發現，即便在全面驗證器可用的情況下，ORM在通過犧牲準確性換取速度來擴展驗證規模方面扮演著關鍵角色。它們的價值在採用生成-修剪-再排序的方法時尤為顯著，該方法中，更快但準確性稍低的驗證器在排序前移除錯誤解——從而構建出一個比完整測試套件僅低8.33%準確性，卻快11.65倍的系統。我們分析了生成-修剪-再排序方法，並展示其通過過濾掉錯誤但排名高的解決方案來發揮作用。這些發現為設計可擴展且準確的程序排序系統提供了依據。

English

The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking -- leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.

獎勵模型通過以準確性換取吞吐量，實現了可擴展的程式碼驗證。

Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput

摘要

Support