代理式批判性训练

摘要

训练大型语言模型作为自主智能体通常从模仿学习开始，但这种范式仅教会智能体如何行动而不理解行动原因：智能体从未将成功行动与次优选择进行对比，因而缺乏对行动质量的认知。近期研究尝试通过引入专家行动与替代行动的对比式自我反思监督机制来解决这一问题。然而其训练范式本质上仍是模仿学习：模型只是模仿预先构建的反思文本，而非学会自主推理。我们提出关键性智能体训练（ACT），这是一种通过强化学习训练智能体在替代行动中识别更优选择的范式。通过奖励模型判断的正确性，ACT驱动模型自主形成对行动质量的推理能力，产生真正的自我反思而非简单模仿。在三个具有挑战性的智能体基准测试中，ACT与不同后训练方法结合时均能持续提升智能体性能，相较模仿学习平均提升5.07个点，相较强化学习平均提升4.62个点。与通过知识蒸馏注入反思能力的方法相比，ACT也展现出明显优势，平均提升达2.42个点。此外，ACT在智能体基准测试中展现出强大的分布外泛化能力，并在未使用任何推理专项训练数据的情况下提升通用推理基准性能，凸显了本方法的独特价值。这些结果表明ACT是开发更具反思能力的高效大模型智能体的可行路径。

English

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.