协同进化的LLM决策与技能库智能体在长周期任务中的应用

摘要

长视野交互环境是评估智能体技能运用能力的测试平台。这类环境需要多步推理、跨多个时间步的技能链式调用，以及在延迟奖励和部分可观测条件下的稳健决策能力。游戏环境为评估智能体技能运用提供了优质测试场。大型语言模型（LLMs）作为游戏智能体展现出潜力，但由于缺乏跨场景发现、保持和复用结构化技能的机制，其在长视野决策一致性方面常显不足。我们提出COSPLAY协同进化框架：LLM决策智能体从可学习的技能库中检索技能以指导行动，而由智能体管理的技能管道则从其无标注运行轨迹中发现可复用技能构建技能库。该框架既提升了决策智能体的技能检索与行动生成能力，又使技能库智能体能持续提取、优化并更新技能及其契约。在六类游戏环境中的实验表明，基于80亿参数模型的COSPLAY在单机游戏基准测试中相较四种前沿LLM基线模型实现平均25.1%的奖励提升，同时在多玩家社交推理游戏中保持竞争力。

English

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

协同进化的LLM决策与技能库智能体在长周期任务中的应用

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

摘要

Support