MetaClaw:隨意對話——一款在真實環境中元學習與自主演化的智能體
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
March 17, 2026
作者: Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao
cs.AI
摘要
大型語言模型(LLM)代理在處理複雜任務時的應用日益廣泛,然而已部署的代理往往保持靜態,無法隨使用者需求演進而調整。這導致持續服務需求與能力更新必要性之間產生矛盾——後者旨在適應不斷變化的任務分佈。在如OpenClaw這類橫跨20多個頻道處理多元工作負載的平台上,現有方法要么未經知識提煉直接儲存原始軌跡,要么維持靜態技能庫,要么需要中斷服務進行模型重訓練。我們提出MetaClaw:一個持續元學習框架,能同步演化基礎LLM策略與可複用行為技能庫。該框架採用兩種互補機制:技能驅動的快速適應透過LLM演化器分析失敗軌跡以合成新技能,實現零停機時間的即時效能提升;機會主義策略優化則透過雲端LoRA微調及流程獎勵模型強化學習(RL-PRM)進行梯度更新,由監控系統閒置狀態與行事曆資料的機會主義元學習排程器(OMLS)在使用者非活躍時段觸發。這兩種機制形成良性循環:優化後的策略產生更佳軌跡供技能合成,而更豐富的技能又為策略優化提供更高品質資料。為防止資料污染,版本控制機制會隔離支援集與查詢集資料。基於代理架構設計的MetaClaw無需本地GPU即可擴展至生產級LLM規模。在MetaClaw-Bench與AutoResearchClaw的實驗顯示,技能驅動適應使準確率相對提升最高達32%。完整流程將Kimi-K2.5的準確率從21.4%提升至40.6%,並使綜合魯棒性提高18.3%。程式碼已開源於:https://github.com/aiming-lab/MetaClaw。
English
Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming-lab/MetaClaw.