프로세스 기반 자기 보상 언어 모델

초록

대형 언어 모델(LLM)은 다양한 다운스트림 작업에서 뛰어난 성능을 보여주며 여러 시나리오에 널리 적용되고 있습니다. 인간이 주석을 단 선호도 데이터는 LLM의 성능을 더욱 향상시키기 위해 훈련에 사용되지만, 이는 인간 성능의 상한선에 의해 제약을 받습니다. 따라서 LLM이 자신의 출력에 대해 보상을 생성함으로써 훈련 데이터를 만들어내는 자기 보상(Self-Rewarding) 방법이 제안되었습니다. 그러나 기존의 자기 보상 패러다임은 수학적 추론 시나리오에서 효과적이지 않으며 오히려 성능 저하를 초래할 수 있습니다. 본 연구에서는 언어 모델을 위한 프로세스 기반 자기 보상 파이프라인을 제안합니다. 이는 자기 보상 패러다임 내에서 장기간 사고 추론, 단계별 LLM-as-a-Judge, 그리고 단계별 선호도 최적화를 도입합니다. 우리의 새로운 패러다임은 반복적인 프로세스 기반 자기 보상을 통해 여러 수학적 추론 벤치마크에서 LLM의 성능을 성공적으로 향상시켰으며, 자기 보상이 인간 능력을 초월할 수 있는 LLM 추론을 달성할 수 있는 엄청난 잠재력을 입증했습니다.

English

Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.

프로세스 기반 자기 보상 언어 모델

Process-based Self-Rewarding Language Models

초록

Support