형식과 길이로부터의 대리 신호: 정답 없이 수학 문제를 해결하기 위한 강화 학습

초록

대형 언어 모델(LLM)은 자연어 처리 작업에서 놀라운 성과를 거두었으며, 이를 특정 응용 분야에 적응시키는 데 강화 학습이 핵심적인 역할을 해왔다. 그러나 수학 문제 해결을 위한 LLM 훈련에서 정답 데이터를 확보하는 것은 종종 어렵고 비용이 많이 들며, 때로는 불가능하기까지 하다. 본 연구는 전통적인 정답 데이터의 필요성을 우회하여, 형식과 길이를 대리 신호로 활용하여 수학 문제 해결을 위한 LLM을 훈련시키는 방법을 탐구한다. 우리의 연구는 형식 정확성에 초점을 맞춘 보상 함수만으로도 초기 단계에서 표준 GRPO 알고리즘과 비슷한 성능 향상을 이끌어낼 수 있음을 보여준다. 후기 단계에서 형식만을 기반으로 한 보상의 한계를 인식하고, 길이 기반 보상을 추가로 통합하였다. 형식-길이 대리 신호를 활용한 GRPO 접근법은 특정 시나리오에서 정답 데이터에 의존하는 표준 GRPO 알고리즘의 성능을 능가하며, 7B 기본 모델로 AIME2024에서 40.0%의 정확도를 달성하였다. 체계적인 탐구와 실험을 통해, 본 연구는 수학 문제 해결을 위한 LLM 훈련과 광범위한 정답 데이터 수집에 대한 의존도를 줄이는 실용적인 해결책을 제시할 뿐만 아니라, 라벨 없는 접근법이 성공하는 본질을 밝혀냈다: 기본 모델은 수학적 및 논리적 추론 능력을 이미 마스터한 우수한 학생과 같지만, 시험지에서 성적이 좋지 않은 경우, 단지 좋은 답안 작성 습관을 개발하면 시험에서 우수한 결과를 얻을 수 있다는 것이다. 즉, 이미 가지고 있는 능력을 발휘할 수 있도록 하는 것이다.

English

Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers.Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.

형식과 길이로부터의 대리 신호: 정답 없이 수학 문제를 해결하기 위한 강화 학습

Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

초록

Support