マルチアテンプト強化学習における失敗からの学習

要旨

大規模言語モデル（LLM）の強化学習（RL）における最近の進展、特にDeepSeek R1に代表される研究では、単純な質問応答タスクであってもLLMの推論能力を大幅に向上させることが示されています。本研究では、このアプローチを拡張し、タスクを複数回試行可能な設定に変更しました。各質問に対して単一の応答を生成する代わりに、モデルは複数回の試行を行い、不正解の後にフィードバックが提供されます。この複数回試行タスクは、モデルが以前の試行を改善し、検索効率を向上させることを促します。実験結果によると、複数回試行タスクで訓練された小さなLLMでも、評価時に試行回数を増やすことで精度が大幅に向上し、数学ベンチマークにおいて1回の試行で45.6%から2回の試行で52.5%に改善されました。対照的に、標準的な単一ターンタスクで訓練された同じLLMは、評価時に試行回数を増やしても42.3%から43.2%とわずかな改善しか示しませんでした。これらの結果は、標準的な単一ターンタスクと比較して、複数回試行タスクで訓練されたLLMが数学ベンチマークでわずかに優れた性能を発揮するだけでなく、ユーザーフィードバックに基づいて応答をより効果的に改善することを学習することを示しています。完全なコードはhttps://github.com/DualityRL/multi-attemptで公開されています。

English

Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

マルチアテンプト強化学習における失敗からの学習

Learning from Failures in Multi-Attempt Reinforcement Learning

要旨

Support