MetaClaw:畅所欲言——一个在真实环境中元学习与进化的智能体
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
March 17, 2026
作者: Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao
cs.AI
摘要
大语言模型(LLM)智能体正日益广泛应用于复杂任务,但已部署的智能体往往保持静态,无法适应用户需求的演变。这导致了持续服务需求与能力更新必要性之间的张力——后者旨在匹配不断变化的任务分布。在OpenClaw这类承载20多个频道多样化工作负载的平台上,现有方法要么未经提炼地存储原始轨迹,要么维持静态技能库,要么需要中断服务进行模型重训练。我们提出MetaClaw——一个持续元学习框架,能协同演化基础LLM策略与可复用行为技能库。该框架通过两种互补机制实现:技能驱动的快速适应通过LLM演化器分析失败轨迹以合成新技能,实现零停机时间的即时性能提升;机会主义策略优化则通过云端LoRA微调和基于过程奖励模型的强化学习(RL-PRM)进行梯度更新,由机会主义元学习调度器(OMLS)在监测到系统空闲时段和日历数据时触发。这两种机制形成良性循环:优化后的策略为技能合成生成更优质的轨迹,而更丰富的技能又为策略优化提供更高质量的数据。为防止数据污染,版本控制机制将支持集与查询集数据隔离。基于代理架构设计的MetaClaw无需本地GPU即可扩展到生产级大模型规模。在MetaClaw-Bench和AutoResearchClaw上的实验表明,技能驱动适应使准确率相对提升最高达32%。完整流水线将Kimi-K2.5模型的准确率从21.4%提升至40.6%,综合鲁棒性提高18.3%。代码已开源:https://github.com/aiming-lab/MetaClaw。
English
Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming-lab/MetaClaw.