格式与长度作为替代信号:无标准答案情况下强化学习求解数学问题
Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers
May 26, 2025
作者: Rihui Xin, Han Liu, Zecheng Wang, Yupeng Zhang, Dianbo Sui, Xiaolin Hu, Bingning Wang
cs.AI
摘要
大型语言模型在自然语言处理任务中取得了显著成功,其中强化学习在使其适应特定应用方面发挥了关键作用。然而,在数学问题求解任务中为训练大型语言模型获取真实答案往往具有挑战性、成本高昂,有时甚至不可行。本研究深入探讨了利用格式和长度作为替代信号来训练大型语言模型进行数学问题求解的方法,从而绕过了对传统真实答案的需求。我们的研究表明,仅以格式正确性为中心的奖励函数在早期阶段就能带来与标准GRPO算法相当的性能提升。认识到仅依赖格式奖励在后期阶段的局限性后,我们引入了基于长度的奖励。由此产生的GRPO方法,通过利用格式-长度替代信号,不仅在某些场景下匹配甚至超越了依赖真实答案的标准GRPO算法的性能,在7B基础模型上实现了AIME2024数据集40.0%的准确率。通过系统探索和实验,本研究不仅为训练大型语言模型解决数学问题提供了一种实用方案,减少了对大量真实答案数据收集的依赖,还揭示了我们的无标签方法成功的关键:基础模型就像一位已经掌握了数学和逻辑推理技能的优秀学生,但在试卷上表现不佳,它只需养成良好的答题习惯就能在考试中取得优异成绩,换句话说,释放它已经具备的能力。
English
Large Language Models have achieved remarkable success in natural language
processing tasks, with Reinforcement Learning playing a key role in adapting
them to specific applications. However, obtaining ground truth answers for
training LLMs in mathematical problem-solving is often challenging, costly, and
sometimes unfeasible. This research delves into the utilization of format and
length as surrogate signals to train LLMs for mathematical problem-solving,
bypassing the need for traditional ground truth answers.Our study shows that a
reward function centered on format correctness alone can yield performance
improvements comparable to the standard GRPO algorithm in early phases.
Recognizing the limitations of format-only rewards in the later phases, we
incorporate length-based rewards. The resulting GRPO approach, leveraging
format-length surrogate signals, not only matches but surpasses the performance
of the standard GRPO algorithm relying on ground truth answers in certain
scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through
systematic exploration and experimentation, this research not only offers a
practical solution for training LLMs to solve mathematical problems and
reducing the dependence on extensive ground truth data collection, but also
reveals the essence of why our label-free approach succeeds: base model is like
an excellent student who has already mastered mathematical and logical
reasoning skills, but performs poorly on the test paper, it simply needs to
develop good answering habits to achieve outstanding results in exams , in
other words, to unlock the capabilities it already possesses.Summary
AI-Generated Summary