GEM: Una Palestra per LLM Agenti

Abstract

Il paradigma di addestramento per i grandi modelli linguistici (LLM) si sta spostando da dataset statici verso un apprendimento basato sull'esperienza, in cui gli agenti acquisiscono competenze attraverso l'interazione con ambienti complessi. Per facilitare questa transizione, introduciamo GEM (General Experience Maker), un simulatore di ambienti open-source progettato per l'era degli LLM. Analogamente a OpenAI-Gym per l'apprendimento per rinforzo (RL) tradizionale, GEM fornisce un framework standardizzato per l'interfaccia ambiente-agente, includendo un'esecuzione vettorializzata asincrona per un'elevata produttività e wrapper flessibili per una facile estensibilità. GEM offre anche una suite diversificata di ambienti, strumenti integrati robusti e script di esempio in file singoli che dimostrano l'uso di GEM con cinque popolari framework di addestramento RL. Insieme a ciò, forniamo anche un set di baseline su 24 ambienti utilizzando REINFORCE con Return Batch Normalization (ReBN), che, a differenza di GRPO, è compatibile con l'impostazione RL completa di ricompense dense per turno e offre una migliore assegnazione del credito. Inoltre, conduciamo un benchmarking diretto di PPO, GRPO e REINFORCE sia in contesti a turno singolo che multi-turno utilizzando GEM per fare luce sulle scelte progettuali algoritmiche. Infine, GEM funge anche da toolkit di valutazione conveniente oltre che da ambiente di addestramento. Speriamo che questo framework possa aiutare ad accelerare la futura ricerca sugli LLM agentici.

English

The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.

GEM: Una Palestra per LLM Agenti

GEM: A Gym for Agentic LLMs

Abstract

Support