ReflexiCoder：基于强化学习的大语言模型代码生成自反思与自修正方法

摘要

尽管大语言模型（LLM）已彻底改变了代码生成领域，但标准的"系统1"方法（通过单次前向传播生成解决方案）在面对复杂算法任务时往往遭遇性能瓶颈。现有的迭代优化策略试图在推理阶段弥补这一差距，但这些方法主要依赖外部验证器、执行反馈或计算成本高昂的提示-响应循环。本研究提出ReflexiCoder，一种新颖的强化学习框架，它将结构化推理轨迹（包括初始生成、缺陷与优化感知的反思以及自我修正）内化至模型权重中。与先前方法不同，ReflexiCoder将范式从依赖外部优化的方式转变为推理阶段内在的、完全自主的自我反思与自我修正能力。我们采用零强化学习训练范式，通过细粒度奖励函数优化整个反思-修正轨迹，使模型在无需真实反馈或执行引擎的情况下学会调试。在七大基准测试上的实验表明，ReflexiCoder-8B在1.5B-14B参数规模的开源模型中确立了新标杆：在HumanEval（Plus）上达到94.51%（87.20%），MBPP（Plus）上达81.80%（78.57%），BigCodeBench为35.00%，LiveCodeBench为52.21%，CodeForces单次尝试设置下达37.34%，性能比肩甚至超越GPT-5.1等专有模型。值得注意的是，该框架具有显著更高的令牌效率，通过规范化的高速推理与反思模式，将推理阶段计算开销降低约40%。源代码已发布于https://github.com/juyongjiang/ReflexiCoder。

English

While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.