専用のフィードバックおよび編集モデルにより、オープンエンドな汎用ドメインタスクにおける推論時のスケーリングが可能になる

要旨

推論時のスケーリングは、OpenAI o1やDeepSeek R1などの最近のモデルの成功に不可欠な要素となっています。しかし、推論時のスケーリングのためにモデルを訓練するために使用される多くの技術は、検証可能な回答を持つタスクを必要とするため、数学、コーディング、論理的推論などの領域に限定されています。私たちは、人間が最初の試みを行い、他者から詳細なフィードバックを求め、そのフィードバックに基づいて幅広いオープンエンドの取り組みにおいて改善を行う方法に着想を得ました。この目的のために、私たちはデータを収集し、オープンエンドの一般領域タスクに対して推論時のスケーリングを実行できる専用のフィードバックモデルと編集モデルを訓練します。私たちの設定では、1つのモデルが初期応答を生成し、2つ目のモデルがその応答に対してフィードバックを行い、3つ目のモデルがそのフィードバックを使用して応答を編集します。私たちは、Chatbot Arena Eloを強く予測するベンチマークであるArena Hardのパフォーマンスが、初期応答の草案数、効果的なフィードバック、および編集された応答のスケーリングによって向上することを示します。最適にスケーリングされた場合、Llama 3ファミリーの70Bモデルに基づく私たちの設定は、2025年3月5日時点でArena Hardにおいて92.7のSoTAパフォーマンスに到達し、90.4のOpenAI o1-preview-2024-09-12と92.3のDeepSeek R1を上回ります。

English

Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback across a wide spectrum of open-ended endeavors. To this end, we collect data for and train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks. In our setup, one model generates an initial response, which are given feedback by a second model, that are then used by a third model to edit the response. We show that performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses. When scaled optimally, our setup based on 70B models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7 as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and DeepSeek R1 with 92.3.

専用のフィードバックおよび編集モデルにより、オープンエンドな汎用ドメインタスクにおける推論時のスケーリングが可能になる

Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks

要旨

Support