点と点を繋ぐ：強化学習によるクロスドメイン汎化を用いた長期運用エージェント向けLLM訓練

要旨

本稿では、大規模言語モデル（LLM）に「点と点を結ぶ」（CoD）能力を訓練するための一般的なフレームワークを提案する。CoDとは、長寿命エージェントに必要とされるメタ能力である。LLMベースのAIエージェントが環境に展開されると、長期間にわたる一連のタスクを解決しながら、継続的に環境を探索し、自身の経験から学習し、環境に関するコンテキストを反復的に自己更新する。これにより、更新されたコンテキストに基づいて将来のタスクでの性能が段階的に向上する。 CoDフレームワークの主要な構成要素は以下の通りである。（1）タスク解決エピソードとコンテキスト更新エピソードを交互に含む長いロールアウト系列を伴うエンドツーエンドの強化学習（RL）のためのアルゴリズム設計とインフラ、（2）訓練中にLLM内で対象のメタ能力を促進・引き出すためのタスクと環境、および評価時にその進捗を正確に測定するためのタスクと環境。本稿では、CoDフレームワークの概念実証実装を示す。具体的には、細粒度のクレジット配分を備えたGRPOスタイルのRLアルゴリズムと、ドメイン固有のLLM能力や標準的なタスク単位のRLではなく、対象のメタ能力に合わせて調整されたタスクと環境を提供する。実験結果は、CoD設定におけるエンドツーエンドRL訓練の有効性を確認し、引き出されたメタ能力が訓練ドメイン内、異なるドメイン間、そしてCoDからRalph-loop設定への分布外汎化の可能性を示している。本稿のCoDに関する研究は、先行研究の複数の流れを結びつけ、LLMとAIエージェントの発展に向けた新たな機会を開くものである。さらなる研究と応用を促進するため、実装をhttps://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_codで公開する。

English

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod.