簡單的測試時間縮放

摘要

測試時間擴展是一種有前途的語言建模新方法，利用額外的測試時間計算來提高性能。最近，OpenAI的o1模型展示了這種能力，但並未公開分享其方法論，導致許多複製努力。我們尋求實現測試時間擴展和強大推理性能的最簡單方法。首先，我們匯集了一個包含1,000個問題和推理軌跡的小數據集s1K，依賴我們通過消融驗證的三個標準：難度、多樣性和質量。其次，我們開發了預算強制方法來控制測試時間計算，通過強制終止模型的思考過程或在模型嘗試結束時多次附加“等待”來延長思考時間。這可以促使模型重新檢查答案，通常修正不正確的推理步驟。在對Qwen2.5-32B-Instruct語言模型在s1K上進行監督微調並配備預算強制後，我們的模型s1在競賽數學問題中超過了o1-preview最多27%（MATH和AIME24）。此外，通過預算強制對s1進行擴展，可以在無需測試時間干預的情況下超越其性能：從AIME24的50%提高到57%。我們的模型、數據和代碼在https://github.com/simplescaling/s1上開源。

English

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.

簡單的測試時間縮放

s1: Simple test-time scaling

摘要

Support