단일 단계 보상을 통한 다중 턴 코드 생성

초록

우리는 다중 턴 실행 피드백을 통한 코드 생성 문제를 다룹니다. 기존 방법들은 피드백 없이 코드를 생성하거나, 다중 턴 보상을 최적화하기 위해 복잡한 계층적 강화 학습을 사용했습니다. 우리는 단일 단계 보상만을 사용하여 다중 턴 코드 생성을 해결하는 간단하면서도 확장 가능한 접근 방식인 muCode를 제안합니다. 우리의 핵심 통찰은 코드 생성이 단일 턴 내에서 어떤 중간 코드 상태에서도 올바른 코드를 복구할 수 있는 단일 단계 복구 가능 MDP(Markov Decision Process)라는 것입니다. muCode는 다중 턴 실행 피드백을 조건으로 코드 솔루션을 제공하는 생성기와 새로 생성된 코드를 평가하는 검증기를 반복적으로 학습합니다. 실험 평가 결과, 우리의 접근 방식이 최신 베이스라인 대비 상당한 개선을 달성함을 보여줍니다. 우리는 보상 모델과 정책의 설계 선택에 대한 분석을 제공하고, muCode가 실행 피드백을 효과적으로 활용하는 능력을 입증합니다. 우리의 코드는 https://github.com/portal-cornell/muCode에서 확인할 수 있습니다.

English

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, muCode, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. muCode iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of muCode at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

단일 단계 보상을 통한 다중 턴 코드 생성

Multi-Turn Code Generation Through Single-Step Rewards

초록

Support