s1: シンプルなテスト時のスケーリング

要旨

テスト時のスケーリングは、性能向上のために追加のテスト時計算を使用する言語モデリングへの有望な新しいアプローチです。最近、OpenAIのo1モデルはこの能力を示しましたが、その方法論を公開しておらず、多くの複製の試みが行われました。私たちは、テスト時のスケーリングと強力な推論性能を達成するための最も簡単なアプローチを求めています。まず、難易度、多様性、品質に依存する推論トレースとペアになった1,000の質問からなる小さなデータセットs1Kを収集します。次に、予算強制を開発して、モデルの思考プロセスを強制的に終了させるか、モデルが終了しようとするときに「Wait」を複数回追加してその長さを延長することで、テスト時の計算を制御します。これにより、モデルは回答を再確認し、しばしば誤った推論ステップを修正します。Qwen2.5-32B-Instruct言語モデルをs1Kで監督されたファインチューニングし、予算強制を装備した後、当社のモデルs1は、競技数学の質問においてo1-previewを最大27%（MATHおよびAIME24）上回ります。さらに、予算強制を使用してs1をスケーリングすることで、テスト時の介入なしにその性能を超えることが可能となりました：AIME24で50%から57%へ。当社のモデル、データ、コードは、https://github.com/simplescaling/s1 でオープンソースで提供されています。

English

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.

s1: シンプルなテスト時のスケーリング

s1: Simple test-time scaling

要旨

Support