主体性批判训练

摘要

訓練大型語言模型作為自主智能體時，通常從模仿學習開始，但這種方法僅教會智能體如何行動卻未理解行動背後的動因：智能體從未將成功行動與次優選項進行對比，因而缺乏對行動質量的認知。近期研究嘗試通過引入專家行動與替代行動對比產生的自我反思監督來解決此問題。然而其訓練範式本質上仍是模仿學習：模型僅模仿預先構建的反思文本，而非學會自主推理。我們提出能動性批判訓練（ACT），這是一種強化學習範式，通過訓練智能體在替代選項中識別更優行動，根據模型判斷是否正確給予獎勵，促使模型自主發展出行動質量的推理能力，實現真正的自我反思而非簡單模仿。在三個具挑戰性的智能體基準測試中，ACT與不同後訓練方法結合時均能持續提升智能體性能，相比模仿學習平均提升5.07個百分點，相比強化學習平均提升4.62個百分點。與通過知識蒸餾注入反思能力的方法相比，ACT也展現出明顯優勢，平均提升達2.42個百分點。此外，ACT在智能體基準測試中展現出強大的分佈外泛化能力，並在未使用任何推理專用訓練數據的情況下，提升通用推理基準的表現，凸顯了本方法的價值。這些結果表明，ACT是開發更具反思性和能力的大型語言模型智能體的有效路徑。

English

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.