连接点：通过强化学习训练大语言模型，实现具备跨域泛化的长生命周期代理

摘要

本文提出了一种通用框架，用于训练大语言模型（LLMs）掌握“串联点”（Connect the Dots, CoD）能力——这是长期生命周期代理所需的一项元能力：基于LLM的人工智能代理在部署到环境中时，需解决一系列长程任务，同时持续探索环境、从自身经验中学习，并迭代更新其对环境的上下文理解，从而在后续任务中基于更新后的上下文逐步提升性能。CoD框架的主要组成部分包括：（1）端到端强化学习（RL）的算法设计与基础设施，支持将任务求解与上下文更新回合交错进行的长 rollout 序列；（2）用于在训练中激励和引导LLM习得目标元能力的任务与环境设计，以及在评估中准确衡量进展的机制。我们提供了CoD框架的概念验证实现，包括一种带有细粒度信用分配的GRPO风格强化学习算法，以及针对目标元能力（而非特定领域LLM能力或标准逐任务RL）定制的任务与环境。实证结果验证了CoD场景中端到端RL训练的有效性，并展示了所诱发元能力在分布外泛化方面的潜力——包括训练域内、跨域以及从CoD到Ralph循环设置中的泛化。我们对CoD的研究连接了多项先前工作，并为推动LLM与AI代理的发展开辟了新机遇。为促进进一步研究与应用，我们将代码实现公开于 https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod。

English

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod.