形式と長さからの代理信号：正解のない数学問題を解くための強化学習

要旨

大規模言語モデル（LLM）は、自然言語処理タスクにおいて顕著な成功を収めており、特定のアプリケーションに適応させる上で強化学習が重要な役割を果たしている。しかし、数学的問題解決におけるLLMの訓練のための正解データを取得することは、しばしば困難でコストがかかり、時には不可能である。本研究では、従来の正解データを必要とせず、形式と長さを代理信号として利用してLLMを数学的問題解決に訓練する方法を探求する。我々の研究は、形式の正確さに基づく報酬関数のみでも、初期段階では標準的なGRPOアルゴリズムと同等の性能向上をもたらすことを示している。後期段階における形式のみの報酬の限界を認識し、長さに基づく報酬を組み込む。結果として得られた形式-長さ代理信号を活用するGRPOアプローチは、特定のシナリオにおいて正解データに依存する標準的なGRPOアルゴリズムの性能を上回り、7BベースモデルでAIME2024において40.0%の精度を達成した。体系的な探求と実験を通じて、本研究は数学的問題解決のためのLLMの訓練と、広範な正解データ収集への依存を軽減する実用的な解決策を提供するだけでなく、ラベルフリーアプローチが成功する本質を明らかにしている：ベースモデルは、数学的および論理的推論スキルを既に習得している優秀な学生のようなものであるが、試験用紙では成績が悪く、単に良い解答習慣を身につけることで試験で優れた結果を達成する必要がある。言い換えれば、既に持っている能力を引き出すことが重要である。

English

Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers.Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.

形式と長さからの代理信号：正解のない数学問題を解くための強化学習

Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

要旨

Support