為長時序任務而生的LLM決策與技能庫智能體協同演化

摘要

長時序互動環境是評估智能體技能運用能力的測試平台。這類環境需要進行多步驟推理、在大量時間步中串聯多項技能，並在延遲獎勵與部分可觀測條件下實現穩健決策。遊戲環境為評估智能體技能運用提供了良好測試場域。大型語言模型（LLMs）作為遊戲智能體展現出潛力，但其缺乏跨情境發現、保持與重用結構化技能的機制，因此在長時序決策一致性方面常遇挑戰。我們提出COSPLAY協同演化框架：LLM決策智能體從可學習技能庫中檢索技能以指導行動，同時由智能體管理的技能管道從未標註行為軌跡中發掘可重用技能構建技能庫。該框架不僅優化決策智能體的技能檢索與行動生成能力，技能庫智能體亦持續提取、精煉並更新技能及其契約條款。在六款遊戲環境中的實驗表明，基於80億參數基礎模型的COSPLAY在單人遊戲基準測試中相比四種前沿LLM基線實現平均25.1%的獎勵提升，同時在多人社交推理遊戲中保持競爭力。

English

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

為長時序任務而生的LLM決策與技能庫智能體協同演化

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

摘要

Support