GEM:智能大语言模型的训练场
GEM: A Gym for Agentic LLMs
October 1, 2025
作者: Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, Chuen Yang Beh, Weixun Wang, Hao Zhu, Weiyan Shi, Diyi Yang, Michael Shieh, Yee Whye Teh, Wee Sun Lee, Min Lin
cs.AI
摘要
大型语言模型(LLMs)的训练范式正从静态数据集转向基于经验的学习,即智能体通过与复杂环境交互来获取技能。为促进这一转变,我们推出了GEM(通用经验生成器),一个专为LLM时代设计的开源环境模拟器。类似于传统强化学习(RL)中的OpenAI-Gym,GEM为环境与智能体之间的交互提供了标准化框架,包括支持高吞吐量的异步向量化执行,以及便于扩展的灵活封装器。GEM还配备了一系列多样化的环境、强大的集成工具,以及单文件示例脚本,展示了如何将GEM与五种流行的RL训练框架结合使用。此外,我们还利用带有回报批量归一化的REINFORCE(ReBN)算法,在24个环境中建立了一组基线,与GRPO不同,ReBN完全兼容每回合密集奖励的完整RL设置,并提供了更优的信用分配机制。我们进一步使用GEM在单回合和多回合设置下对PPO、GRPO和REINFORCE进行了公平对比基准测试,以揭示算法设计的优劣。最后,GEM不仅作为训练环境,还充当了便捷的评估工具包。我们期望这一框架能够加速未来智能型LLM的研究进程。
English
The training paradigm for large language models (LLMs) is moving from static
datasets to experience-based learning, where agents acquire skills via
interacting with complex environments. To facilitate this transition we
introduce GEM (General Experience Maker), an open-source environment simulator
designed for the age of LLMs. Analogous to OpenAI-Gym for traditional
reinforcement learning (RL), GEM provides a standardized framework for the
environment-agent interface, including asynchronous vectorized execution for
high throughput, and flexible wrappers for easy extensibility. GEM also
features a diverse suite of environments, robust integrated tools, and
single-file example scripts demonstrating using GEM with five popular RL
training frameworks. Along with this, we also provide a set of baselines across
24 environments using REINFORCE with Return Batch Normalization (ReBN), which
-- unlike GRPO -- is compatible with the full RL setting of dense per-turn
rewards and offers better credit assignment. We further conduct apple-to-apple
benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings
using GEM to shed light on the algorithmic designs. Lastly, GEM also functions
as a convenient evaluation toolkit besides a training environment. We hope this
framework can help accelerate future agentic LLM research.