다중 시도 강화 학습에서의 실패로부터 학습하기

초록

대규모 언어 모델(LLM)을 위한 강화 학습(RL)의 최근 발전은 DeepSeek R1과 같은 사례에서 볼 수 있듯이, 단순한 질문-응답 작업조차도 LLM의 추론 능력을 크게 향상시킬 수 있음을 보여주었습니다. 본 연구에서는 이러한 접근법을 확장하여 작업을 다중 시도 설정으로 수정했습니다. 질문당 단일 응답을 생성하는 대신, 모델은 여러 번의 시도 기회를 가지며, 잘못된 응답 후에는 피드백이 제공됩니다. 다중 시도 작업은 모델이 이전 시도를 개선하고 탐색 효율성을 높이도록 장려합니다. 실험 결과에 따르면, 다중 시도 작업으로 훈련된 소규모 LLM은 더 많은 시도로 평가할 때 정확도가 크게 향상되었으며, 수학 벤치마크에서 1회 시도 시 45.6%에서 2회 시도 시 52.5%로 증가했습니다. 반면, 동일한 LLM이 표준 단일 턴 작업으로 훈련된 경우 평가 시 더 많은 시도를 허용해도 42.3%에서 43.2%로 미미한 개선만 보였습니다. 이러한 결과는 표준 단일 턴 작업과 비교하여 다중 시도 작업으로 훈련된 LLM이 수학 벤치마크에서 약간 더 나은 성능을 달성할 뿐만 아니라 사용자 피드백을 기반으로 응답을 더 효과적으로 개선하는 방법을 학습함을 나타냅니다. 전체 코드는 https://github.com/DualityRL/multi-attempt에서 확인할 수 있습니다.

English

Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

다중 시도 강화 학습에서의 실패로부터 학습하기

Learning from Failures in Multi-Attempt Reinforcement Learning

초록

Support