獎勵模型通過以準確性換取吞吐量,實現了可擴展的程式碼驗證。
Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput
June 11, 2025
作者: Gabriel Orlanski, Nicholas Roberts, Aws Albarghouthi, Frederic Sala
cs.AI
摘要
利用大型語言模型(LLMs)解決編碼任務的標準範式是先生成再排序程序,其中排序過程使用驗證器。當前普遍共識認為,只要可能,應優先考慮全面驗證器(如完整測試套件)而非結果獎勵模型(ORM),而很少考慮其中的權衡。我們旨在通過系統探索速度與準確性之間的權衡來挑戰這一假設。我們發現,即便在全面驗證器可用的情況下,ORM在通過犧牲準確性換取速度來擴展驗證規模方面扮演著關鍵角色。它們的價值在採用生成-修剪-再排序的方法時尤為顯著,該方法中,更快但準確性稍低的驗證器在排序前移除錯誤解——從而構建出一個比完整測試套件僅低8.33%準確性,卻快11.65倍的系統。我們分析了生成-修剪-再排序方法,並展示其通過過濾掉錯誤但排名高的解決方案來發揮作用。這些發現為設計可擴展且準確的程序排序系統提供了依據。
English
The standard paradigm for solving coding tasks via large language models
(LLMs) is to generate-then-rank programs, where the latter step uses a verifier
in the ranking process. The growing consensus is that a comprehensive verifier
(e.g., a full test suite) should be prioritized over an outcome reward model
(ORM) whenever possible, with little consideration given to the trade-offs
involved. We aim to challenge this assumption by systematically exploring the
tradeoff between speed and accuracy. We find that ORMs play a crucial role in
scaling verification through trading accuracy for speed, even when a
comprehensive verifier is available. Their value becomes especially apparent
when used in a generate-prune-then-rank approach, where a faster but less
accurate verifier removes incorrect solutions prior to ranking -- leading to a
system that is 11.65x faster while only being 8.33% less accurate than the full
test suite. We analyze the generate-prune-then-rank approach and show that it
works by filtering out incorrect but highly ranked solutions. These findings
enable the design of scalable and accurate program ranking systems.