報酬モデルは、精度とスループットをトレードオフすることで、スケーラブルなコード検証を可能にする

要旨

大規模言語モデル（LLM）を用いてコーディングタスクを解決するための標準的なパラダイムは、プログラムを生成してからランク付けする「generate-then-rank」アプローチであり、ランク付けの過程では検証器が使用されます。現在の一般的な認識では、可能な限り包括的な検証器（例えば、完全なテストスイート）が結果報酬モデル（ORM）よりも優先されるべきであり、そのトレードオフについてはほとんど考慮されていません。本研究では、この前提に挑戦し、速度と精度のトレードオフを体系的に探求することを目指します。その結果、包括的な検証器が利用可能な場合でも、ORMが精度を犠牲にして速度を向上させることで検証をスケーリングする上で重要な役割を果たすことがわかりました。特に、生成-刈り込み-ランク付け（generate-prune-then-rank）アプローチにおいて、ORMの価値が顕著に現れます。このアプローチでは、高速だが精度の低い検証器がランク付けの前に誤った解を除去し、その結果、完全なテストスイートと比較して11.65倍高速でありながら、精度はわずか8.33%低下するシステムが実現されます。我々はgenerate-prune-then-rankアプローチを分析し、それが誤って高いランク付けされた解をフィルタリングすることで機能することを示します。これらの知見は、スケーラブルで正確なプログラムランク付けシステムの設計を可能にします。

English

The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking -- leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.

報酬モデルは、精度とスループットをトレードオフすることで、スケーラブルなコード検証を可能にする

Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput

要旨

Support