ベイジアンエージェント: LLMエージェント活用のための事後分布誘導スキル進化

要旨

LLMエージェントは、プロンプト、ツール、メモリ、SOP、スキル、ハーネスフィードバックといった外部推論条件への依存度を高めている。これらのアセットは、モデルの重みを変更せずにタスク実行を改善できるが、多くの場合、ヒューリスティックな振り返りや、観察された成功・失敗をあたかも件数だけが信頼できる信念であるかのように再利用することで修正される。本稿では、Bayesian-Agentを提案する。これは、再利用可能なスキルやSOPを、凍結モデルが特定のプロンプト、コンテキスト、ハーネス環境下で成功するかどうかに関する仮説として扱う、ネイティブかつクロスハーネスのフレームワークである。Bayesian-Agentは、検証済みの軌跡証拠を記録し、各スキルに対して特徴量で条件付けられたカテゴリカル事後分布を維持し、その事後状態をパッチ、分割、圧縮、破棄、探索といった監視可能なアクションにマッピングする。モデル向けのプロンプトには実行可能なガードレールと障害モードパッチが付与され、事後分布の要約は監査可能な状態で保持される。DeepSeek-V4-Flashを用いた場合、段階的な修復により、SOP-Benchでは80%から95%、Lifelong AgentBenchでは90%から100%、RealFin-Benchでは45%から65%に性能が向上した。さらに、Bayesian-Agentのネイティブバックエンドと、オプションのGenericAgent、mini-swe-agent、Claude Codeの各バックエンドについて評価を行った。結果には、正例、負例、飽和状態、ケーススタディの設定が含まれており、エージェントのスキル進化は、調整されていないプロンプトの蓄積ではなく、事後分布に導かれたハーネス最適化として捉えるのが最適であることを示唆している。ソースコードは https://github.com/DataArcTech/Bayesian-Agent で公開されている。

English

LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce Bayesian-Agent, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With deepseek-v4-flash, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.