從多輪嘗試強化學習中的失敗中學習

摘要

近期在大型語言模型（LLMs）強化學習（RL）領域的進展，以DeepSeek R1為例，顯示即使是簡單的問答任務也能顯著提升LLM的推理能力。在本研究中，我們將此方法擴展，將任務修改為多輪嘗試的設定。模型不再對每個問題生成單一回應，而是給予多次嘗試機會，並在錯誤回應後提供反饋。這種多輪嘗試任務促使模型改進先前的嘗試並提升搜索效率。實驗結果表明，即使在多輪嘗試任務上訓練的小型LLM，在評估時使用更多嘗試也能顯著提高準確率，在數學基準測試中從單次嘗試的45.6%提升至兩次嘗試的52.5%。相比之下，同一LLM在標準單輪任務上訓練後，在評估時給予更多嘗試僅表現出微幅提升，從42.3%增至43.2%。這些結果表明，與標準單輪任務相比，在多輪嘗試任務上訓練的LLM不僅在數學基準測試中表現略優，還能更有效地根據用戶反饋精煉其回應。完整程式碼可於https://github.com/DualityRL/multi-attempt 取得。

English

Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

從多輪嘗試強化學習中的失敗中學習

Learning from Failures in Multi-Attempt Reinforcement Learning

摘要

Support