GEM：智能大语言模型的训练场

摘要

大型语言模型（LLMs）的训练范式正从静态数据集转向基于经验的学习，即智能体通过与复杂环境交互来获取技能。为促进这一转变，我们推出了GEM（通用经验生成器），一个专为LLM时代设计的开源环境模拟器。类似于传统强化学习（RL）中的OpenAI-Gym，GEM为环境与智能体之间的交互提供了标准化框架，包括支持高吞吐量的异步向量化执行，以及便于扩展的灵活封装器。GEM还配备了一系列多样化的环境、强大的集成工具，以及单文件示例脚本，展示了如何将GEM与五种流行的RL训练框架结合使用。此外，我们还利用带有回报批量归一化的REINFORCE（ReBN）算法，在24个环境中建立了一组基线，与GRPO不同，ReBN完全兼容每回合密集奖励的完整RL设置，并提供了更优的信用分配机制。我们进一步使用GEM在单回合和多回合设置下对PPO、GRPO和REINFORCE进行了公平对比基准测试，以揭示算法设计的优劣。最后，GEM不仅作为训练环境，还充当了便捷的评估工具包。我们期望这一框架能够加速未来智能型LLM的研究进程。

English

The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.

GEM：智能大语言模型的训练场

GEM: A Gym for Agentic LLMs

摘要

Support