단순한 "다시 시도하세요"가 다중 턴 LLM 추론을 유도할 수 있다

초록

다중 턴 문제 해결은 대규모 추론 모델(Large Reasoning Models, LRMs)이 자신의 추론을 반영하고 피드백을 통해 수정하는 데 있어 중요하면서도 어려운 과제이다. 기존의 강화 학습(Reinforcement Learning, RL) 방법은 검증 가능한 보상을 통해 대규모 추론 모델을 단일 턴 패러다임으로 훈련시킨다. 그러나 기존 RL 패러다임으로 훈련된 모델은 종종 다중 턴에 걸쳐 문제를 해결하는 능력을 상실하고, 문맥적 피드백을 기반으로 답변을 수정하는 데 어려움을 겪어 반복적인 응답을 생성하는 것으로 관찰되었다. 이에 우리는 다음과 같은 질문을 제기한다: LRMs가 다중 턴 문맥에서 자신의 답변을 반영하도록 학습할 수 있는가? 본 연구에서는 잘못된 답변 후에 단순한 피드백(예: "다시 시도해 보자")만을 사용하여 다중 턴 RL로 모델을 훈련시키는 것이 단일 턴 성능과 다중 턴 추론 능력을 모두 향상시킬 수 있음을 발견하였다. 우리는 반복적 문제 해결 과정에서 최소한이면서도 일반적인 단일 피드백을 사용하는 '단일 피드백 관찰(Unary Feedback as Observation, UFO)'을 강화 학습에 도입하였다. 이 방법은 기존의 단일 턴 RL 훈련 설정에 쉽게 적용할 수 있다. 실험 결과, UFO를 사용한 RL 훈련은 단일 턴 성능을 유지하면서 다중 턴 추론 정확도를 최대 14%까지 향상시켜, 언어 모델이 다중 턴 문제 해결에서 피드백에 더 잘 반응할 수 있도록 했다. 또한, 올바른 답변을 얻기 위해 필요한 턴 수를 최소화하면서 실수가 발생했을 때 다양한 추론을 유도하기 위해, 각 턴에서 신중하고 의도적인 답변을 생성하도록 모델을 유도하는 보상 구조를 설계하였다. 코드: https://github.com/lichengliu03/unary-feedback

English

Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

단순한 "다시 시도하세요"가 다중 턴 LLM 추론을 유도할 수 있다

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

초록

Support