高性能強化学習環境の自動生成

要旨

複雑な強化学習（RL）環境を高性能な実装に変換する作業は、従来、数ヶ月に及ぶ専門的なエンジニアリングを必要としてきました。本研究では、再利用可能な手法―具体的には、汎用プロンプトテンプレート、階層的検証、エージェント支援による反復的修復―を提案します。この手法により、計算コスト10ドル未満で意味的に等価な高性能環境を生成できます。5つの環境において、3つの異なるワークフローを実証しました。直接変換（既存の高性能実装が存在しない場合）：EmuRust（Game BoyエミュレータのRust並列化によるPPO速度1.5倍）および、初のGPU並列化PokemonバトルシミュレータであるPokeJAX（ランダム行動時5億SPS、PPO時1520万SPS；TypeScriptリファレンス比22,320倍）。既存の高性能実装に対する検証付き変換：MJXとのスループット同等性（1.04倍）、同一GPUバッチサイズにおけるBrax比5倍（HalfCheetah JAX）；PPO比42倍（Puffer Pong）。新規環境作成：ウェブ抽出仕様から合成された、初のデプロイ可能なJAX版PokemonカードゲームエンジンTCGJax（ランダム行動時71.7万SPS、PPO時15.3万SPS；Pythonリファレンス比6.6倍）。パラメータ数が2億に達すると、環境のオーバーヘッドは学習時間の4%未満に低下します。階層的検証（特性テスト、相互作用テスト、ロールアウトテスト）により、5環境全ての意味的等価性が確認されました。また、バックエンドを跨ぐポリシー転送により、5環境全てでシミュレーション間ギャップがゼロであることが確認されました。公開リポジトリに存在しない非公開リファレンスから合成されたTCGJaxは、エージェント事前学習データに関する汚染対策として機能します。本論文には、代表的なプロンプト、検証方法論、完全な結果を含む十分な詳細が記載されており、コーディングエージェントが論文から直接変換を再現できる内容となっています。

English

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.

高性能強化学習環境の自動生成

Automatic Generation of High-Performance RL Environments

要旨

Support