Automatische Generatie van Hoogwaardige RL-omgevingen

Samenvatting

Het vertalen van complexe reinforcement learning (RL)-omgevingen naar hoogwaardige implementaties vereiste traditioneel maanden van gespecialiseerd technisch werk. Wij presenteren een herbruikbare aanpak - een generieke promptsjabloon, hiërarchische verificatie en iteratief agent-ondersteund herstel - die semantisch equivalente, hoogpresterende omgevingen oplevert voor <$10 aan rekenkosten. We demonstreren drie verschillende workflows in vijf omgevingen. Directe vertaling (geen bestaande performante implementatie): EmuRust (1.5x PPO-snelheidswinst via Rust-parallelisatie voor een Game Boy-emulator) en PokeJAX, de eerste GPU-parallele Pokémon-gevechtssimulator (500M SPS willekeurige acties, 15.2M SPS PPO; 22.320x sneller dan de TypeScript-referentie). Vertaling geverifieerd tegen bestaande performante implementaties: doorvoerpariteit met MJX (1.04x) en 5x sneller dan Brax bij gelijke GPU-batchgroottes (HalfCheetah JAX); 42x PPO (Puffer Pong). Nieuwe omgevingscreatie: TCGJax, de eerste inzetbare JAX Pokémon TCG-engine (717K SPS willekeurige acties, 153K SPS PPO; 6.6x sneller dan de Python-referentie), gesynthetiseerd vanuit een web-geëxtraheerde specificatie. Bij 200M parameters daalt de omgevingsoverhead onder 4% van de trainingstijd. Hiërarchische verificatie (eigenschap-, interactie- en rollout-tests) bevestigt semantische equivalentie voor alle vijf omgevingen; cross-backend policy-transfer bevestigt een nul sim-to-sim kloof voor alle vijf omgevingen. TCGJax, gesynthetiseerd vanuit een privé-referentie die niet in publieke repositories voorkomt, dient als contaminatiecontrole voor zorgen over agent-pretrainingsdata. De paper bevat voldoende detail - inclusief representatieve prompts, verificatiemethodologie en complete resultaten - zodat een coderende agent de vertalingen direct vanuit het manuscript zou kunnen reproduceren.

English

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.

Automatische Generatie van Hoogwaardige RL-omgevingen

Automatic Generation of High-Performance RL Environments

Samenvatting

Support