从推理到能动性:大型语言模型强化学习中的信用分配
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
April 13, 2026
作者: Chenchen Zhang
cs.AI
摘要
针对大语言模型(LLM)的强化学习(RL)日益依赖稀疏的结果级奖励,但确定长轨迹中哪些行为导致了结果仍然困难。这一信用分配(CA)问题体现在两种范式下:推理式强化学习需在单次思维链生成(500至3万+词元)中跨词元和步骤分配信用;而智能体式强化学习则因多轮环境交互引入了随机状态转移、部分可观测性及100+轮次(10万至100万词元)的决策视野,使得回合级信用逐渐失去信息价值。
本文系统梳理了2024年至2026年初发表的47种信用分配方法(41种核心方法,6种相邻支撑技术),通过分配粒度(词元、片段、步骤、轮次、多智能体)与方法论(蒙特卡洛、时序差分、模型驱动、博弈论、信息论)两个维度构建了分类体系。除综述外,我们贡献了三项可复用资源:(1)带有分类标签、基线族系和证据等级的结构化机器可读文献库;(2)经已综述文献验证的未来信用分配论文报告清单,用于识别系统性方法缺陷;(3)包含任务族系、元数据要求和受控分叉任务的基准协议规范,并附有方法选择决策树。
我们的综合分析表明,从推理式到智能体式强化学习的转变正重塑信用分配格局:推理式信用分配正围绕过程奖励模型与无评论者群体比较走向成熟,而智能体式信用分配则催生了真正创新的方法——事后反事实分析、特权不对称评论者和轮次级MDP重构——这些在推理式强化学习中并无直接先例。
English
Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500--30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K--1M tokens), making episode-level credit increasingly uninformative.
We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree.
Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches -- hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations -- that have no direct precedent in reasoning RL.