從多輪嘗試強化學習中的失敗中學習
Learning from Failures in Multi-Attempt Reinforcement Learning
March 4, 2025
作者: Stephen Chung, Wenyu Du, Jie Fu
cs.AI
摘要
近期在大型語言模型(LLMs)強化學習(RL)領域的進展,以DeepSeek R1為例,顯示即使是簡單的問答任務也能顯著提升LLM的推理能力。在本研究中,我們將此方法擴展,將任務修改為多輪嘗試的設定。模型不再對每個問題生成單一回應,而是給予多次嘗試機會,並在錯誤回應後提供反饋。這種多輪嘗試任務促使模型改進先前的嘗試並提升搜索效率。實驗結果表明,即使在多輪嘗試任務上訓練的小型LLM,在評估時使用更多嘗試也能顯著提高準確率,在數學基準測試中從單次嘗試的45.6%提升至兩次嘗試的52.5%。相比之下,同一LLM在標準單輪任務上訓練後,在評估時給予更多嘗試僅表現出微幅提升,從42.3%增至43.2%。這些結果表明,與標準單輪任務相比,在多輪嘗試任務上訓練的LLM不僅在數學基準測試中表現略優,還能更有效地根據用戶反饋精煉其回應。完整程式碼可於https://github.com/DualityRL/multi-attempt 取得。
English
Recent advancements in reinforcement learning (RL) for large language models
(LLMs), exemplified by DeepSeek R1, have shown that even a simple
question-answering task can substantially improve an LLM's reasoning
capabilities. In this work, we extend this approach by modifying the task into
a multi-attempt setting. Instead of generating a single response per question,
the model is given multiple attempts, with feedback provided after incorrect
responses. The multi-attempt task encourages the model to refine its previous
attempts and improve search efficiency. Experimental results show that even a
small LLM trained on a multi-attempt task achieves significantly higher
accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt
to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM
trained on a standard single-turn task exhibits only a marginal improvement,
increasing from 42.3% to 43.2% when given more attempts during evaluation. The
results indicate that, compared to the standard single-turn task, an LLM
trained on a multi-attempt task achieves slightly better performance on math
benchmarks while also learning to refine its responses more effectively based
on user feedback. Full code is available at
https://github.com/DualityRL/multi-attemptSummary
AI-Generated Summary