簡單的測試時間縮放
s1: Simple test-time scaling
January 31, 2025
作者: Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
cs.AI
摘要
測試時間擴展是一種有前途的語言建模新方法,利用額外的測試時間計算來提高性能。最近,OpenAI的o1模型展示了這種能力,但並未公開分享其方法論,導致許多複製努力。我們尋求實現測試時間擴展和強大推理性能的最簡單方法。首先,我們匯集了一個包含1,000個問題和推理軌跡的小數據集s1K,依賴我們通過消融驗證的三個標準:難度、多樣性和質量。其次,我們開發了預算強制方法來控制測試時間計算,通過強制終止模型的思考過程或在模型嘗試結束時多次附加“等待”來延長思考時間。這可以促使模型重新檢查答案,通常修正不正確的推理步驟。在對Qwen2.5-32B-Instruct語言模型在s1K上進行監督微調並配備預算強制後,我們的模型s1在競賽數學問題中超過了o1-preview最多27%(MATH和AIME24)。此外,通過預算強制對s1進行擴展,可以在無需測試時間干預的情況下超越其性能:從AIME24的50%提高到57%。我們的模型、數據和代碼在https://github.com/simplescaling/s1上開源。
English
Test-time scaling is a promising new approach to language modeling that uses
extra test-time compute to improve performance. Recently, OpenAI's o1 model
showed this capability but did not publicly share its methodology, leading to
many replication efforts. We seek the simplest approach to achieve test-time
scaling and strong reasoning performance. First, we curate a small dataset s1K
of 1,000 questions paired with reasoning traces relying on three criteria we
validate through ablations: difficulty, diversity, and quality. Second, we
develop budget forcing to control test-time compute by forcefully terminating
the model's thinking process or lengthening it by appending "Wait" multiple
times to the model's generation when it tries to end. This can lead the model
to double-check its answer, often fixing incorrect reasoning steps. After
supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and
equipping it with budget forcing, our model s1 exceeds o1-preview on
competition math questions by up to 27% (MATH and AIME24). Further, scaling s1
with budget forcing allows extrapolating beyond its performance without
test-time intervention: from 50% to 57% on AIME24. Our model, data, and code
are open-source at https://github.com/simplescaling/s1.Summary
AI-Generated Summary