ChatPaper.aiChatPaper

大规模进化策略:超越强化学习的大语言模型微调

Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

September 29, 2025
作者: Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen
cs.AI

摘要

针对下游任务对预训练大型语言模型(LLMs)进行微调,是人工智能部署流程中的关键环节。强化学习(RL)无疑是最为突出的微调方法,为众多顶尖LLMs的诞生做出了贡献。相比之下,进化策略(ES)虽曾在参数规模为数百万的模型上展现出与RL相媲美的性能,却因被认为难以扩展至更大模型而遭到忽视。本研究首次成功实现了利用ES对LLMs全部参数进行大规模微调,揭示了ES能够在数十亿参数规模上高效搜索,并在多个方面超越现有RL微调方法的惊人事实,包括样本效率、对长周期奖励的耐受性、对不同基础LLMs的鲁棒性、更低的奖励欺骗倾向以及跨运行更稳定的性能表现。因此,本研究为超越当前RL技术,开辟LLM微调新方向奠定了基础。源代码已发布于:https://github.com/VsonicV/es-fine-tuning-paper。
English
Fine-tuning pre-trained large language models (LLMs) for down-stream tasks is a critical step in the AI deployment pipeline. Reinforcement learning (RL) is arguably the most prominent fine-tuning method, contributing to the birth of many state-of-the-art LLMs. In contrast, evolution strategies (ES), which once showed comparable performance to RL on models with a few million parameters, was neglected due to the pessimistic perception of its scalability to larger models. In this work, we report the first successful attempt to scale up ES for fine-tuning the full parameters of LLMs, showing the surprising fact that ES can search efficiently over billions of parameters and outperform existing RL fine-tuning methods in multiple respects, including sample efficiency, tolerance to long-horizon rewards, robustness to different base LLMs, less tendency to reward hacking, and more stable performance across runs. It therefore serves as a basis to unlock a new direction in LLM fine-tuning beyond what current RL techniques provide. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.
PDF42September 30, 2025