高性能强化学习环境的自动生成
Automatic Generation of High-Performance RL Environments
March 12, 2026
作者: Seth Karten, Rahul Dev Appapogu, Chi Jin
cs.AI
摘要
将复杂的强化学习环境转化为高性能实现传统上需要数月的专业工程开发。我们提出了一种可复用的方法——包含通用提示模板、分层验证和迭代式智能体辅助修复——能够以低于10美元的计算成本生成语义等效的高性能环境。我们展示了跨五个环境的三种差异化工作流:直接翻译(无现有性能实现):EmuRust(通过Rust并行化实现Game Boy模拟器的PPO速度提升1.5倍)和首个GPU并行的Pokemon对战模拟器PokeJAX(随机操作5亿步/秒,PPO策略1520万步/秒,较TypeScript参考实现提升22,320倍);基于现有性能实现的验证翻译:在匹配GPU批处理量时达到MJX吞吐量持平(1.04倍)且超越Brax 5倍(HalfCheetah JAX环境),PPO训练速度提升42倍(Puffer Pong环境);新环境创建:从网络提取的规范合成出首个可部署的JAX版Pokemon集换式卡牌引擎TCGJAX(随机操作71.7万步/秒,PPO策略15.3万步/秒,较Python参考实现提升6.6倍)。当模型参数达2亿时,环境开销降至训练时间的4%以下。分层验证(属性测试、交互测试和推演测试)确认所有五个环境均保持语义等效;跨后端策略迁移证实所有环境实现零模拟差异。TCGJAX作为智能体预训练数据污染的对照样本,其私有参考实现未出现在公共代码库中。本文提供了足够详尽的实现细节——包括代表性提示模板、验证方法和完整结果——使得编码智能体可直接根据论文复现所有翻译实现。
English
Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.