三思而后行：通过扩展多轮测试时思考提升大语言模型推理能力

摘要

近期，大型语言模型（LLMs）如OpenAI-o1和DeepSeek-R1的进展，展示了测试时扩展的有效性，其中延长的推理过程显著提升了模型性能。尽管如此，当前模型在处理长文本和强化学习（RL）训练效率方面仍存在局限。为解决这些问题，我们提出了一种简单而有效的测试时扩展方法——多轮思考。该方法通过利用前一轮答案作为后续轮次的提示，迭代优化模型推理。在包括QwQ-32B和DeepSeek-R1在内的多个模型上进行的大量实验，一致显示在AIME 2024、MATH-500、GPQA-diamond和LiveCodeBench等多个基准上的性能提升。例如，QwQ-32B在AIME 2024数据集上的准确率从第一轮的80.3%提升至第二轮的82.1%，而DeepSeek-R1也表现出类似的增长，从79.7%增至82.0%。这些结果证实，多轮思考是一种广泛适用、直接的方法，能够实现模型性能的稳定提升，凸显了其在未来测试时扩展技术发展中的潜力。关键提示：{原始问题提示} 助手的上一轮回答是：<答案> {上一轮答案} </答案>，请重新作答。

English

Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: <answer> {last round answer} </answer>, and please re-answer.

三思而后行：通过扩展多轮测试时思考提升大语言模型推理能力

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

摘要

Support