奖励模型通过以精度换取吞吐量,实现了可扩展的代码验证。
Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput
June 11, 2025
作者: Gabriel Orlanski, Nicholas Roberts, Aws Albarghouthi, Frederic Sala
cs.AI
摘要
通过大型语言模型(LLMs)解决编码任务的标准范式是先生成后排序程序,其中排序过程使用验证器。当前普遍认为,只要条件允许,应优先采用全面验证器(如完整测试套件)而非结果奖励模型(ORM),而很少考虑其中的权衡。我们旨在通过系统性地探索速度与准确性之间的权衡来挑战这一假设。我们发现,即便在全面验证器可用的情况下,ORM在通过牺牲部分准确性换取速度以扩展验证规模方面发挥着关键作用。特别是在采用生成-修剪-再排序的方法时,ORM的价值尤为显著,其中更快但准确性稍低的验证器在排序前剔除错误解——这使得系统速度提升了11.65倍,而准确性仅比完整测试套件低8.33%。我们分析了生成-修剪-再排序方法,并展示其通过过滤掉错误但排名靠前的解决方案来发挥作用。这些发现为设计可扩展且准确的程序排序系统提供了依据。
English
The standard paradigm for solving coding tasks via large language models
(LLMs) is to generate-then-rank programs, where the latter step uses a verifier
in the ranking process. The growing consensus is that a comprehensive verifier
(e.g., a full test suite) should be prioritized over an outcome reward model
(ORM) whenever possible, with little consideration given to the trade-offs
involved. We aim to challenge this assumption by systematically exploring the
tradeoff between speed and accuracy. We find that ORMs play a crucial role in
scaling verification through trading accuracy for speed, even when a
comprehensive verifier is available. Their value becomes especially apparent
when used in a generate-prune-then-rank approach, where a faster but less
accurate verifier removes incorrect solutions prior to ranking -- leading to a
system that is 11.65x faster while only being 8.33% less accurate than the full
test suite. We analyze the generate-prune-then-rank approach and show that it
works by filtering out incorrect but highly ranked solutions. These findings
enable the design of scalable and accurate program ranking systems.