ReflexiCoder: 대규모 언어 모델이 생성한 코드를 자기 성찰하고 강화 학습을 통해 자가 수정하도록 가르치기

초록

대규모 언어 모델(LLM)이 코드 생성을 혁신적으로 변화시켰지만, 단일 순전파로 해결책을 생성하는 표준 "시스템 1" 접근법은 복잡한 알고리즘 과제에 직면했을 때 종종 성능 한계에 부딪힙니다. 기존의 반복적 개선 전략은 추론 시점에서 이 격차를 메우려 시도하지만, 주로 외부 오라클, 실행 피드백 또는 계산 비용이 높은 프롬프트-응답 주기에 의존합니다. 본 연구에서는 구조화된 추론 궤적(초기 생성, 버그 및 최적화 인식 반성, 자기 수정을 포함)을 모델 가중치에 직접 내재화하는 새로운 강화 학습(RL) 프레임워크인 ReflexiCoder를 제안합니다. 기존 방법과 달리 ReflexiCoder는 추론 시점에서 외부 의존적 개선에서 내재적이고 완전 자율적인 자기 반성 및 자기 수정 능력으로 패러다임을 전환합니다. 세분화된 보상 함수를 활용한 RL-zero 훈련 패러다임을 사용하여 전체 반성-수정 궤적을 최적화함으로써, 모델이 추론 시점에 정답 피드백이나 실행 엔진에 의존하지 않고 디버깅하는 방법을 학습하게 합니다. 7개 벤치마크에 대한 광범위한 실험을 통해 우리의 ReflexiCoder-8B 모델이 1.5B-14B 범위의 주요 오픈소스 모델 중 새로운 최첨단(SOTA) 성능을 확립함을 입증했습니다. 단일 시도 설정에서 HumanEval(Plus) 94.51%(87.20%), MBPP(Plus) 81.80%(78.57%), BigCodeBench 35.00%, LiveCodeBench 52.21%, CodeForces 37.34%의 성적을 달성하여 GPT-5.1과 같은 독점 모델에 필적하거나 능가했습니다. 특히, 우리 프레임워크는 기본 모델보다 토큰 효율성이 현저히 높아, 체계적이고 고속의 추론 및 반성 패턴을 통해 추론 시점 계산 오버헤드를 약 40% 줄였습니다. 소스 코드는 https://github.com/juyongjiang/ReflexiCoder에서 확인할 수 있습니다.

English

While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.

ReflexiCoder: 대규모 언어 모델이 생성한 코드를 자기 성찰하고 강화 학습을 통해 자가 수정하도록 가르치기

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

초록

Support