基於過程的自獎勵語言模型
Process-based Self-Rewarding Language Models
March 5, 2025
作者: Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong
cs.AI
摘要
大型語言模型在多種下游任務中展現了卓越的性能,並已廣泛應用於多個場景。為了進一步提升大型語言模型的表現,使用人工標註的偏好數據進行訓練,但這受到人類表現上限的限制。因此,提出了自我獎勵方法,即大型語言模型通過獎勵自身輸出來生成訓練數據。然而,現有的自我獎勵範式在數學推理場景中效果不佳,甚至可能導致性能下降。在本研究中,我們提出了基於過程的自我獎勵管道,該管道在自我獎勵範式中引入了長程推理、逐步的LLM-as-a-Judge以及逐步的偏好優化。我們的新範式通過迭代的基於過程的自我獎勵,成功提升了大型語言模型在多個數學推理基準上的表現,展示了自我獎勵在實現可能超越人類能力的大型語言模型推理方面的巨大潛力。
English
Large Language Models have demonstrated outstanding performance across
various downstream tasks and have been widely applied in multiple scenarios.
Human-annotated preference data is used for training to further improve LLMs'
performance, which is constrained by the upper limit of human performance.
Therefore, Self-Rewarding method has been proposed, where LLMs generate
training data by rewarding their own outputs. However, the existing
self-rewarding paradigm is not effective in mathematical reasoning scenarios
and may even lead to a decline in performance. In this work, we propose the
Process-based Self-Rewarding pipeline for language models, which introduces
long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference
optimization within the self-rewarding paradigm. Our new paradigm successfully
enhances the performance of LLMs on multiple mathematical reasoning benchmarks
through iterative Process-based Self-Rewarding, demonstrating the immense
potential of self-rewarding to achieve LLM reasoning that may surpass human
capabilities.Summary
AI-Generated Summary