보상 모델은 정확도를 처리량과 교환함으로써 확장 가능한 코드 검증을 가능하게 한다

초록

대규모 언어 모델(LLMs)을 통해 코딩 작업을 해결하기 위한 표준 패러다임은 프로그램을 생성한 후 순위를 매기는 방식으로, 후자의 단계에서는 검증기를 사용하여 순위를 매깁니다. 현재의 일반적인 견해는 가능한 경우 결과 보상 모델(ORM)보다는 포괄적인 검증기(예: 전체 테스트 스위트)를 우선시해야 한다는 것이며, 이 과정에서 발생하는 트레이드오프는 거의 고려되지 않습니다. 우리는 이러한 가정에 도전하기 위해 속도와 정확성 사이의 트레이드오프를 체계적으로 탐구하고자 합니다. 우리는 ORM이 정확성을 속도와 교환함으로써 검증을 확장하는 데 중요한 역할을 한다는 것을 발견했으며, 이는 포괄적인 검증기가 사용 가능한 경우에도 마찬가지입니다. ORM의 가치는 특히 생성-제거-후-순위 매기기 접근법에서 두드러지는데, 이 접근법에서는 더 빠르지만 덜 정확한 검증기가 순위를 매기기 전에 잘못된 해결책을 제거함으로써 전체 테스트 스위트보다 11.65배 빠르면서도 정확도는 단 8.33%만 낮은 시스템을 구현할 수 있습니다. 우리는 생성-제거-후-순위 매기기 접근법을 분석하고, 이 방법이 잘못되었지만 높은 순위를 받은 해결책을 필터링함으로써 작동한다는 것을 보여줍니다. 이러한 발견은 확장 가능하고 정확한 프로그램 순위 매기기 시스템을 설계하는 데 기여합니다.

English

The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking -- leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.

보상 모델은 정확도를 처리량과 교환함으로써 확장 가능한 코드 검증을 가능하게 한다

Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput

초록

Support