貝氏智能體:後驗引導的LLM智能體束具技能演化
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses
June 6, 2026
作者: Xiaojun Wu, Cehao Yang, Honghao Liu, Xueyuan Lin, Wenjie Zhang, Zhichao Shi, Xuhui Jiang, Chengjin Xu, Jia Li, Jian Guo
cs.AI
摘要
LLM代理日益依賴外部推理條件:提示詞、工具、記憶、標準作業流程(SOP)、技能以及框架回饋。這些資產無需改變模型權重即可改善任務執行,但往往透過啟發式反思或重複利用觀察到的成功與失敗(彷彿僅憑次數即可構成可靠信念)來進行修訂。我們提出Bayesian-Agent,這是一個原生且跨框架的系統,將可重複使用的技能與SOP視為對於一個凍結模型在特定提示詞、上下文及框架環境下是否會成功的假設。Bayesian-Agent記錄經過驗證的軌跡證據,維護一個基於特徵條件化的分類後驗分佈,並將後驗狀態映射為可檢查的操作,例如修補、分割、壓縮、淘汰與探索。面向模型的提示詞獲得可執行的防護措施與失敗模式修補,而後驗摘要則可供審計。使用deepseek-v4-flash時,增量修復將SOP-Bench從80%提升至95%,Lifelong AgentBench從90%提升至100%,RealFin-Bench從45%提升至65%。我們進一步評估了Bayesian-Agent的原生後端以及可選的GenericAgent、mini-swe-agent和Claude Code後端。結果涵蓋正向、負向、飽和及案例研究設定,表明代理技能的演化最好視為後驗引導的框架優化,而非未經校準的提示詞累積。原始碼已公開於https://github.com/DataArcTech/Bayesian-Agent。
English
LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce Bayesian-Agent, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With deepseek-v4-flash, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.