두 번 생각하라: 다중 라운드 테스트 타임 사고 확장을 통한 LLM 추론 능력 향상

초록

최근 OpenAI-o1 및 DeepSeek-R1과 같은 대형 언어 모델(LLMs)의 발전은 테스트 시간 스케일링의 효과를 입증하며, 확장된 추론 과정이 모델 성능을 크게 향상시킬 수 있음을 보여주었습니다. 그러나 현재 모델들은 긴 텍스트 처리와 강화 학습(RL) 훈련 효율성에서의 한계로 인해 제약을 받고 있습니다. 이러한 문제를 해결하기 위해, 우리는 간단하면서도 효과적인 테스트 시간 스케일링 접근법인 다중 라운드 사고(Multi-round Thinking)를 제안합니다. 이 방법은 이전 답변을 후속 라운드의 프롬프트로 활용하여 모델의 추론을 반복적으로 개선합니다. QwQ-32B 및 DeepSeek-R1을 포함한 여러 모델에 걸친 광범위한 실험은 AIME 2024, MATH-500, GPQA-diamond, LiveCodeBench와 같은 다양한 벤치마크에서 일관된 성능 향상을 보여줍니다. 예를 들어, QwQ-32B의 정확도는 AIME 2024 데이터셋에서 80.3%(1라운드)에서 82.1%(2라운드)로 향상되었으며, DeepSeek-R1도 79.7%에서 82.0%로 유사한 증가를 보였습니다. 이러한 결과는 다중 라운드 사고가 모델 성능의 안정적인 향상을 달성하기 위한 폭넓게 적용 가능하고 간단한 접근법임을 확인하며, 테스트 시간 스케일링 기술의 미래 발전 가능성을 강조합니다. 주요 프롬프트: {원본 질문 프롬프트} 어시스턴트의 이전 답변은: <answer> {이전 라운드 답변} </answer>이며, 다시 답변해 주세요.

English

Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: <answer> {last round answer} </answer>, and please re-answer.

두 번 생각하라: 다중 라운드 테스트 타임 사고 확장을 통한 LLM 추론 능력 향상

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

초록

Support