고성능 강화학습 환경의 자동 생성

초록

복잡한 강화 학습(RL) 환경을 고성능 구현체로 변환하는 작업은 전통적으로 수개월에 걸친 전문적인 엔지니어링이 필요했습니다. 본 논문은 재사용 가능한 방법론—일반적인 프롬프트 템플릿, 계층적 검증, 반복적인 에이전트 지원 수정—을 제시하며, 이는 약 $10 미만의 컴퓨팅 비용으로 의미론적으로 동등한 고성능 환경을 생성합니다. 우리는 5가지 환경에 걸쳐 세 가지 별개의 워크플로를 입증합니다. 직접 변환(기존 고성능 구현체 없음): EmuRust(게임보이 에뮬레이터용 Rust 병렬화를 통한 PPO 속도 1.5배 향상) 및 최초의 GPU 병렬 포켓몬 배틀 시뮬레이터인 PokeJAX(무작위 행동 기준 초당 5억 스텝, PPO 기준 초당 1,520만 스텝; TypeScript 기준 구현체 대비 22,320배). 기존 고성능 구현체 대비 검증된 변환: MJX와의 처리량 동등성(1.04배) 및 동일 GPU 배치 크기에서 Brax 대비 5배 성능(HalfCheetah JAX); PPO 기준 42배 성능(Puffer Pong). 새로운 환경 생성: 웹에서 추출한 명세로부터 합성된, 최초의 배포 가능한 JAX 기반 포켓몬 TCG 엔진인 TCGJax(무작위 행동 기준 초당 71.7만 스텝, PPO 기준 초당 15.3만 스텝; Python 기준 구현체 대비 6.6배). 2억 개의 매개변수 기준으로 환경 오버헤드는 학습 시간의 4% 미만으로 감소합니다. 계층적 검증(속성, 상호작용, 롤아웃 테스트)은 5가지 환경 모두에 대한 의미론적 동등성을 확인하며; 크로스-백엔드 정책 전이는 5가지 환경 모두에서 시뮬레이터 간 격차가 없음을 확인합니다. 공개 저장소에 존재하지 않는 비공개 기준 구현체로부터 합성된 TCGJax는 에이전트 사전 학습 데이터 오염 문제에 대한 통제 변인 역할을 합니다. 본 논문에는 코딩 에이전트가 원고만으로 변환 작업을 직접 재현할 수 있도록 대표적인 프롬프트, 검증 방법론, 완전한 결과를 포함한 충분한 세부 사항이 포함되어 있습니다.

English

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.

고성능 강화학습 환경의 자동 생성

Automatic Generation of High-Performance RL Environments

초록

Support