「もう一度試す」というシンプルな指示が多段階のLLM推論を引き起こす

要旨

多段階問題解決は、大規模推論モデル（LRM）が自身の推論を振り返り、フィードバックから修正を行う上で重要でありながらも困難な課題である。既存の強化学習（RL）手法は、検証可能な報酬を用いて単一ターンのパラダイムで大規模推論モデルを訓練する。しかし、既存のRLパラダイムで訓練されたモデルは、多段階にわたる問題解決能力を失い、文脈に基づくフィードバックに応じて回答を修正することが難しく、繰り返しの応答を引き起こすことが観察される。我々は問う：LRMは多段階の文脈で自身の回答を反映することを学習できるか？本研究では、誤った回答後に単一のフィードバック（例：「もう一度試してみよう」）のみを用いた多段階RLでモデルを訓練することで、単一ターンの性能と多段階推論の両方を改善できることを発見した。我々は、反復的な問題解決中に最小限でありながら一般的な単一のユーザーフィードバックを使用する「Unary Feedback as Observation（UFO）」を強化学習に導入する。これは既存の単一ターンRL訓練セットアップに容易に適用可能である。実験結果は、UFOを用いたRL訓練が単一ターンの性能を維持し、多段階推論の精度を最大14％向上させ、言語モデルが多段階問題解決におけるフィードバックにより適切に反応できることを示している。さらに、正しい回答に必要なターン数を最小化しつつ、誤りが発生した際に多様な推論を促すために、各ターンで慎重かつ意図的な回答を生成するようモデルを導く報酬構造を設計した。コード：https://github.com/lichengliu03/unary-feedback

English

Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

「もう一度試す」というシンプルな指示が多段階のLLM推論を引き起こす

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

要旨

Support