베이지안 에이전트: LLM 에이전트 하네스를 위한 사후 확률 기반 스킬 진화

초록

LLM 에이전트는 점점 더 프롬프트, 도구, 메모리, SOP, 스킬, 그리고 하네스 피드백과 같은 외부 추론 조건에 의존하고 있다. 이러한 자산들은 모델 가중치를 변경하지 않고도 작업 실행을 개선할 수 있지만, 경험적 성찰이나 관찰된 성공과 실패를 마치 단순 빈도만으로 신뢰할 수 있는 신념인 양 재사용함으로써 수정되는 경우가 많다. 본 논문에서는 재사용 가능한 스킬과 SOP를 특정 프롬프트, 컨텍스트 및 하네스 환경 하에서 고정된 모델이 성공할지 여부에 대한 가설로 취급하는 네이티브 및 크로스-하네스 프레임워크인 Bayesian-Agent를 소개한다. Bayesian-Agent는 검증된 궤적 증거를 기록하고, 각 스킬에 대한 특징 조건부 범주형 사후 분포를 유지하며, 사후 상태를 패치, 분할, 압축, 폐기, 탐색과 같은 검사 가능한 행동으로 매핑한다. 모델 대면 프롬프트는 실행 가능한 가드레일과 실패 모드 패치를 제공받는 반면, 사후 요약 정보는 감사 가능하도록 유지된다. DeepSeek-V4-Flash를 사용한 점진적 수정을 통해 SOP-Bench는 80%에서 95%로, Lifelong AgentBench는 90%에서 100%로, RealFin-Bench는 45%에서 65%로 성능이 향상되었다. 또한 Bayesian-Agent의 네이티브 백엔드와 선택적 GenericAgent, mini-swe-agent, Claude Code 백엔드를 평가한다. 실험 결과에는 긍정적, 부정적, 포화 및 사례 연구 설정이 포함되며, 이는 에이전트 스킬 진화가 보정되지 않은 프롬프트 축적이 아닌 사후 분포 기반 하네스 최적화로 보는 것이 가장 적절함을 시사한다. 소스 코드는 https://github.com/DataArcTech/Bayesian-Agent에서 확인할 수 있다.

English

LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce Bayesian-Agent, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With deepseek-v4-flash, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.