ChatPaper.aiChatPaper

连接在线与离线强化学习:基于情境赌博机的多轮代码生成方法

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

February 3, 2026
作者: Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun
cs.AI

摘要

近期,利用强化学习(RL)在真实世界任务(如多轮代码生成)上训练大语言模型(LLM)的研究备受关注。虽然在线强化学习通常优于离线强化学习,但其较高的训练成本和不稳定性限制了广泛应用。本文基于多轮代码生成可建模为单步可恢复马尔可夫决策过程的观察,提出了基于离线轨迹的上下文赌博机学习(Cobalt),该方法融合了在线与离线强化学习的优势。Cobalt首先使用参考LLM收集代码生成轨迹,并将其分割为部分轨迹作为上下文提示。在在线赌博机学习阶段,通过单步代码生成训练LLM完成每个部分轨迹提示。实验表明,Cobalt在LiveCodeBench上显著优于基于GRPO和VeRPO的两种多轮在线强化学习基线方法,并将R1-Distill 8B和Qwen3 8B的绝对Pass@1分数分别提升高达9.0和6.2分。此外,我们分析了LLM的上下文奖励破解行为,并通过引入扰动轨迹增强Cobalt训练以缓解该问题。总体而言,我们的结果表明Cobalt为多轮代码生成等迭代决策任务提供了有前景的解决方案。代码与数据已开源:https://github.com/OSU-NLP-Group/cobalt。
English
Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.
PDF43February 8, 2026