从推理到能动性：大型语言模型强化学习中的信用分配

摘要

针对大语言模型（LLM）的强化学习（RL）日益依赖稀疏的结果级奖励，但确定长轨迹中哪些行为导致了结果仍然困难。这一信用分配（CA）问题体现在两种范式下：推理式强化学习需在单次思维链生成（500至3万+词元）中跨词元和步骤分配信用；而智能体式强化学习则因多轮环境交互引入了随机状态转移、部分可观测性及100+轮次（10万至100万词元）的决策视野，使得回合级信用逐渐失去信息价值。本文系统梳理了2024年至2026年初发表的47种信用分配方法（41种核心方法，6种相邻支撑技术），通过分配粒度（词元、片段、步骤、轮次、多智能体）与方法论（蒙特卡洛、时序差分、模型驱动、博弈论、信息论）两个维度构建了分类体系。除综述外，我们贡献了三项可复用资源：（1）带有分类标签、基线族系和证据等级的结构化机器可读文献库；（2）经已综述文献验证的未来信用分配论文报告清单，用于识别系统性方法缺陷；（3）包含任务族系、元数据要求和受控分叉任务的基准协议规范，并附有方法选择决策树。我们的综合分析表明，从推理式到智能体式强化学习的转变正重塑信用分配格局：推理式信用分配正围绕过程奖励模型与无评论者群体比较走向成熟，而智能体式信用分配则催生了真正创新的方法——事后反事实分析、特权不对称评论者和轮次级MDP重构——这些在推理式强化学习中并无直接先例。

English

Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500--30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K--1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches -- hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations -- that have no direct precedent in reasoning RL.

从推理到能动性：大型语言模型强化学习中的信用分配

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

摘要

Support