格式與長度之替代信號：無真實答案下數學問題求解的強化學習

摘要

大型語言模型在自然語言處理任務中取得了顯著成功，其中強化學習在使其適應特定應用方面發揮了關鍵作用。然而，在數學問題求解中為訓練大型語言模型獲取真實答案往往具有挑戰性、成本高昂，有時甚至不可行。本研究深入探討了利用格式和長度作為替代信號來訓練大型語言模型進行數學問題求解，從而繞過對傳統真實答案的需求。我們的研究表明，僅以格式正確性為核心的獎勵函數在早期階段即可帶來與標準GRPO算法相當的性能提升。認識到僅依賴格式獎勵在後期階段的局限性，我們引入了基於長度的獎勵。由此產生的GRPO方法，利用格式-長度替代信號，不僅在某些場景下匹配甚至超越了依賴真實答案的標準GRPO算法的性能，在7B基礎模型上於AIME2024測試中達到了40.0%的準確率。通過系統的探索與實驗，本研究不僅為訓練大型語言模型解決數學問題提供了一種實用方案，並減少了對大量真實數據收集的依賴，而且揭示了我們無標籤方法成功的本質：基礎模型如同一位已掌握數學與邏輯推理技能的優秀學生，但在試卷上表現不佳，它只需培養良好的答題習慣即可在考試中取得優異成績，換言之，釋放其已具備的能力。

English

Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers.Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.

格式與長度之替代信號：無真實答案下數學問題求解的強化學習

Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

摘要

Support