GEM：面向自主性大语言模型的训练场

摘要

大型語言模型（LLMs）的訓練範式正從靜態數據集轉向基於經驗的學習，其中智能體通過與複雜環境的互動來獲取技能。為促進這一轉變，我們引入了GEM（通用經驗生成器），這是一個專為LLMs時代設計的開源環境模擬器。類似於傳統強化學習（RL）中的OpenAI-Gym，GEM提供了一個標準化的環境-智能體接口框架，包括用於高吞吐量的異步向量化執行，以及易於擴展的靈活包裝器。GEM還具備多樣化的環境套件、強大的集成工具，以及展示如何將GEM與五種流行的RL訓練框架結合使用的單文件示例腳本。此外，我們還提供了一組基於REINFORCE與回報批次歸一化（ReBN）的基準測試，涵蓋24個環境，與GRPO不同，ReBN兼容密集每回合獎勵的完整RL設置，並提供了更好的信用分配。我們進一步使用GEM在單回合和多回合設置下對PPO、GRPO和REINFORCE進行了同類比較基準測試，以揭示算法設計的優劣。最後，GEM除了作為訓練環境外，還是一個便捷的評估工具包。我們希望這一框架能夠幫助加速未來智能LLM的研究。

English

The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.

GEM：面向自主性大语言模型的训练场

GEM: A Gym for Agentic LLMs

摘要

Support