エージェンシック・クリティカル・トレーニング

要旨

大規模言語モデル（LLM）を自律エージェントとして訓練する際、初期段階では模倣学習が用いられることが多い。しかし、この方法ではエージェントに「何をすべきか」は教えられるものの、「なぜそうすべきか」という理解が欠如している。エージェントは成功した行動と最適でない代替案を対比することがないため、行動の質に対する認識が育たないのである。この問題に対処するため、最近の研究では専門家の行動と代替行動の対比から得られる自己内省の監督信号を導入する手法が提案されている。しかし、その訓練パラダイムは根本的には模倣学習の枠組みを脱しておらず、モデルは事前に構築された内省テキストを模倣するだけで、自律的に推論する能力を学習しているわけではない。本研究では、強化学習のパラダイムであるAgentic Critical Training（ACT）を提案する。ACTは、エージェントが複数の選択肢の中からより良い行動を特定することを学習させる。モデルの判断が正しいかどうかに基づいて報酬を与えることで、ACTはモデルが自律的に行動の質に関する推論能力を発達させ、模倣ではなく真の自己内省を生み出すように導く。 3つの難易度の高いエージェントベンチマークにおいて、ACTは様々な事後訓練手法と組み合わせることで、一貫してエージェントの性能を向上させた。平均改善点は、模倣学習ベースラインに対して5.07ポイント、強化学習ベースラインに対して4.62ポイントに達した。知識蒸留によって内省能力を付与する手法と比較しても、ACTは明確な優位性を示し、平均2.42ポイントの改善を達成した。さらに、ACTはエージェントベンチマークにおいて強力な分布外一般化を実現し、推論専用の訓練データを一切用いない場合でも、一般的な推論ベンチマークの性能を向上させた。これは本手法の価値を際立たせている。これらの結果は、ACTがより内省的で能力の高いLLMエージェントを開発するための有望な道筋であることを示唆している。

English

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.