점을 잇다: 강화 학습을 통한 교차 도메인 일반화를 갖춘 장수명 에이전트를 위한 LLM 훈련

초록

본 연구는 장기 수명 주기 에이전트(long-lifecycle agents)에 필요한 메타 능력(meta-capability)인 LLM(Large Language Model) 기반 AI 에이전트가 환경에 배치된 후, 긴 작업 시퀀스를 해결하면서 환경을 지속적으로 탐색하고, 자신의 경험에서 학습하며, 환경에 대한 컨텍스트를 반복적으로 자기 갱신(self-update)하여 업데이트된 컨텍스트를 바탕으로 향후 작업에서 점진적으로 더 나은 성능을 달성할 수 있도록 하는 '단서들을 연결하기(Connect the Dots, CoD)'를 위해 대규모 언어 모델을 훈련하는 일반적인 프레임워크를 제시한다. CoD 프레임워크의 주요 구성 요소는 다음과 같다: (1) 과제 해결(solve-task) 및 컨텍스트 업데이트(update-context) 에피소드가 교차되는 긴 롤아웃(rollout) 시퀀스를 사용한 종단 간 강화 학습(end-to-end reinforcement learning, RL)을 위한 알고리즘 설계 및 인프라; (2) 훈련 중 LLM의 표적 메타 능력을 장려하고 유도하며, 평가 중 진행 상황을 충실히 측정하기 위한 작업 및 환경. 우리는 세분화된 신용 할당(fine-grained credit assignment)을 갖춘 GRPO 스타일 강화 학습 알고리즘과 표적 메타 능력(도메인 특정 LLM 능력이나 표준 작업별 RL이 아닌)에 맞춘 작업 및 환경을 포함한 CoD 프레임워크의 개념 증명(proof-of-concept) 구현을 제시한다. 실증 결과는 CoD 설정에서 종단 간 강화 학습 훈련의 효용성을 검증하며, 훈련 도메인 내, 다양한 도메인 간, 그리고 CoD에서 Ralph-루프 설정으로의 분포 외 일반화(out-of-distribution generalization) 가능성을 보여준다. CoD에 대한 우리의 연구는 여러 이전 연구들을 연결하며, LLM과 AI 에이전트 발전을 위한 새로운 기회를 열어준다. 추가 연구 및 응용을 촉진하기 위해, 우리의 구현을 https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod 에서 공개한다.

English

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod.