三思而後行：通過擴展多輪測試時思考來增強大型語言模型的推理能力

摘要

近期，大型語言模型（LLMs）如OpenAI-o1和DeepSeek-R1的進展，展示了測試時擴展的有效性，其中延長的推理過程顯著提升了模型性能。儘管如此，當前模型在處理長文本和強化學習（RL）訓練效率方面仍存在限制。為解決這些問題，我們提出了一種簡單而有效的測試時擴展方法——多輪思考。該方法通過利用前幾輪的答案作為後續輪次的提示，迭代地精煉模型推理。在多個模型（包括QwQ-32B和DeepSeek-R1）上的廣泛實驗，一致顯示了在AIME 2024、MATH-500、GPQA-diamond和LiveCodeBench等多個基準上的性能提升。例如，QwQ-32B在AIME 2024數據集上的準確率從80.3%（第一輪）提升至82.1%（第二輪），而DeepSeek-R1也從79.7%提升至82.0%。這些結果證實，多輪思考是一種廣泛適用且直接的方法，能夠穩定提升模型性能，凸顯了其在未來測試時擴展技術發展中的潛力。關鍵提示：{原始問題提示} 助手的前一輪答案是：<答案> {上一輪答案} </答案>，請重新回答。

English

Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: <answer> {last round answer} </answer>, and please re-answer.

三思而後行：通過擴展多輪測試時思考來增強大型語言模型的推理能力

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

摘要

Support