AgentDevel: 自己進化型LLMエージェントをリリースエンジニアリングとして再定義する

要旨

大規模言語モデル（LLM）エージェントの近年の進歩は、主にエージェント内部への自己改善メカニズムの組み込み、あるいは多数の並行バリアントの探索に焦点が当てられてきた。これらのアプローチは総合的なスコアを向上させ得るが、不安定で監査が困難な改善軌道をもたらすことが多く、バージョン間での非退行性の保証や障害の原因究明を困難にしている。本研究では、エージェントの改善をリリースエンジニアリングとして再定義する：エージェントは出荷可能な成果物として扱われ、改善は回帰を意識したリリースパイプラインに外部化される。我々はAgentDevelを提案する。これは、現在のエージェントを反復的に実行し、実行トレースから実装に依存しない症状レベルの品質シグナルを生成し、実行可能な診断を通じて単一のリリース候補（RC）を合成し、フリップ中心のゲーティングの下でそれを昇格させるリリースエンジニアリングパイプラインである。AgentDevelは三つの核心的な設計を特徴とする：(i) エージェントの内部構造にアクセスせずに故障の外観を特徴づける、実装に依存しないLLM批評器、(ii) 支配的な症状パターンを集約し監査可能な工学的仕様を生成するスクリプトベースの実行可能診断、(iii) 合格から不合格への回帰および不合格から合格への修正を第一級の証拠として優先する、フリップ中心のゲーティングである。個体群ベースの探索やエージェント内自己改良とは異なり、AgentDevelは単一の正規バージョンラインを維持し、非退行性を主要目的として重視する。実行負荷の高いベンチマークにおける実験により、AgentDevelが再現性と監査可能性のある成果物を生成しながら、著しく少ない回帰で安定した改善をもたらすことが実証された。全体として、AgentDevelはLLMエージェントをソフトウェア開発として構築、デバッグ、リリースするための実践的な開発手法を提供する。

English

Recent progress in large language model (LLM) agents has largely focused on embedding self-improvement mechanisms inside the agent or searching over many concurrent variants. While these approaches can raise aggregate scores, they often yield unstable and hard-to-audit improvement trajectories, making it difficult to guarantee non-regression or to reason about failures across versions. We reframe agent improvement as release engineering: agents are treated as shippable artifacts, and improvement is externalized into a regression-aware release pipeline. We introduce AgentDevel, a release engineering pipeline that iteratively runs the current agent, produces implementation-blind, symptom-level quality signals from execution traces, synthesizes a single release candidate (RC) via executable diagnosis, and promotes it under flip-centered gating. AgentDevel features three core designs: (i) an implementation-blind LLM critic that characterizes failure appearances without accessing agent internals, (ii) script-based executable diagnosis that aggregates dominant symptom patterns and produces auditable engineering specifications, and (iii) flip-centered gating that prioritizes pass to fail regressions and fail to pass fixes as first-class evidence. Unlike population-based search or in-agent self-refinement, AgentDevel maintains a single canonical version line and emphasizes non-regression as a primary objective. Experiments on execution-heavy benchmarks demonstrate that AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts. Overall, AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development.

AgentDevel: 自己進化型LLMエージェントをリリースエンジニアリングとして再定義する

AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering

要旨

Support