AgentDevel：将自进化大语言模型智能体重构为发布工程

摘要

近期大语言模型（LLM）智能体的研究进展主要聚焦于在智能体内部嵌入自我优化机制或对大量并行变体进行搜索。虽然这些方法能够提升综合评分，但其改进轨迹往往不稳定且难以审计，导致无法确保版本间的无退化特性或有效追溯跨版本故障。我们将智能体改进重新定义为发布工程：将智能体视为可交付产物，把改进过程外化为具备回归感知的发布流水线。本文提出AgentDevel——一种迭代运行当前智能体、从执行轨迹中生成与实现无关的症状级质量信号、通过可执行诊断合成单一候选版本（RC）并基于翻转中心门控机制进行版本晋升的发布工程流水线。AgentDevel具有三大核心设计：（一）与实现无关的LLM批评器，无需访问智能体内部即可表征故障表象；（二）基于脚本的可执行诊断，能聚合主导症状模式并生成可审计的工程规范；（三）以翻转为中心的门控机制，将“通过转失败”的回归现象与“失败转通过”的修复结果作为首要评估依据。与基于群体搜索或智能体内自优化不同，AgentDevel维护单一主线版本，并将非回归性作为核心目标。在重度执行基准测试中的实验表明，AgentDevel能以显著更少的回归实现稳定改进，同时生成可复现、可审计的交付物。总体而言，AgentDevel为将LLM智能体的构建、调试与发布纳入软件开发范畴提供了实用的工程规范。

English

Recent progress in large language model (LLM) agents has largely focused on embedding self-improvement mechanisms inside the agent or searching over many concurrent variants. While these approaches can raise aggregate scores, they often yield unstable and hard-to-audit improvement trajectories, making it difficult to guarantee non-regression or to reason about failures across versions. We reframe agent improvement as release engineering: agents are treated as shippable artifacts, and improvement is externalized into a regression-aware release pipeline. We introduce AgentDevel, a release engineering pipeline that iteratively runs the current agent, produces implementation-blind, symptom-level quality signals from execution traces, synthesizes a single release candidate (RC) via executable diagnosis, and promotes it under flip-centered gating. AgentDevel features three core designs: (i) an implementation-blind LLM critic that characterizes failure appearances without accessing agent internals, (ii) script-based executable diagnosis that aggregates dominant symptom patterns and produces auditable engineering specifications, and (iii) flip-centered gating that prioritizes pass to fail regressions and fail to pass fixes as first-class evidence. Unlike population-based search or in-agent self-refinement, AgentDevel maintains a single canonical version line and emphasizes non-regression as a primary objective. Experiments on execution-heavy benchmarks demonstrate that AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts. Overall, AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development.

AgentDevel：将自进化大语言模型智能体重构为发布工程

AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering

摘要

Support