连接在线与离线强化学习:基于情境赌博机的多轮代码生成研究
Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation
February 3, 2026
作者: Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun
cs.AI
摘要
近期,利用强化学习在现实世界任务(如多轮代码生成)上训练大语言模型的研究备受关注。虽然在线强化学习通常比离线强化学习表现更优,但其较高的训练成本和不稳定性限制了广泛应用。本文基于多轮代码生成可被构建为单步可恢复马尔可夫决策过程这一观察,提出了基于离线轨迹的上下文赌博机学习方法Cobalt,该方法融合了在线与离线强化学习的优势。Cobalt首先使用参考大语言模型收集代码生成轨迹,并将其分割为部分轨迹作为上下文提示。随后在在线赌博机学习阶段,通过单步代码生成训练大语言模型完成每个部分轨迹提示。实验表明,Cobalt在LiveCodeBench基准上显著优于基于GRPO和VeRPO的两种多轮在线强化学习基线方法,并将R1-Distill 8B和Qwen3 8B模型的绝对Pass@1分数分别提升最高达9.0分和6.2分。此外,我们分析了大语言模型的上下文奖励破解行为,并通过引入扰动轨迹增强Cobalt训练以缓解该问题。总体而言,我们的结果证明Cobalt是处理多轮代码生成等迭代决策任务的有效方案。代码与数据已开源:https://github.com/OSU-NLP-Group/cobalt。
English
Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.