AgentDevel: 자기 진화 LLM 에이전트를 릴리스 엔지니어링으로 재구성하기

초록

대규모 언어 모델(LLM) 에이전트의 최근 발전은 주로 에이전트 내부에 자기 개선 메커니즘을 내장하거나 여러 동시 변형을 탐색하는 데 중점을 두어 왔습니다. 이러한 접근 방식은 종합 점수를 향상시킬 수 있지만, 종종 불안정하고 감사하기 어려운 개선 궤적을 초래하여 버전 간 비회귀(non-regression)를 보장하거나 실패 원인을 분석하기 어렵게 만듭니다. 우리는 에이전트 개선을 릴리스 엔지니어링의 관점으로 재정의합니다. 즉, 에이전트를 출시 가능한 결과물로 취급하고 개선 과정을 회귀 인식 릴리스 파이프라인으로 외부화합니다. 본 논문에서는 AgentDevel이라는 릴리스 엔지니어링 파이프라인을 소개합니다. 이 파이프라인은 현재 에이전트를 반복적으로 실행하고, 실행 추적에서 구현 방식과 무관한 증상 수준의 품질 신호를 생성하며, 실행 가능한 진단을 통해 단일 릴리스 후보(RC)를 합성하고, 플립 중심 게이팅(flip-centered gating) 하에 이를 승격합니다. AgentDevel은 세 가지 핵심 설계를 특징으로 합니다: (i) 에이전트 내부 구현에 접근하지 않고 실패의 외형적 특성을 파악하는 구현 방식-무관 LLM 비평가, (ii) 지배적인 증상 패턴을 집계하고 감사 가능한 엔지니어링 명세를 생성하는 스크립트 기반 실행 가능 진단, (iii) 통과에서 실패로의 회귀(Pass-to-Fail)와 실패에서 통과로의 수정(Fail-to-Pass)을 1급 증거로 우선시하는 플립 중심 게이팅. 개체군 기반 탐색이나 에이전트 내 자기 개선과 달리, AgentDevel은 단일 표준 버전 라인을 유지하며 비회귀를 주요 목표로 강조합니다. 실행 중심 벤치마크에서의 실험 결과, AgentDevel은 회귀를 현저히 줄이면서도 안정적인 개선을 달성하고 재현 가능하며 감사 가능한 결과물을 생성함을 보여줍니다. 전반적으로 AgentDevel은 LLM 에이전트를 소프트웨어 개발처럼 구축, 디버깅 및 출시하기 위한 실용적인 개발 방법론을 제공합니다.

English

Recent progress in large language model (LLM) agents has largely focused on embedding self-improvement mechanisms inside the agent or searching over many concurrent variants. While these approaches can raise aggregate scores, they often yield unstable and hard-to-audit improvement trajectories, making it difficult to guarantee non-regression or to reason about failures across versions. We reframe agent improvement as release engineering: agents are treated as shippable artifacts, and improvement is externalized into a regression-aware release pipeline. We introduce AgentDevel, a release engineering pipeline that iteratively runs the current agent, produces implementation-blind, symptom-level quality signals from execution traces, synthesizes a single release candidate (RC) via executable diagnosis, and promotes it under flip-centered gating. AgentDevel features three core designs: (i) an implementation-blind LLM critic that characterizes failure appearances without accessing agent internals, (ii) script-based executable diagnosis that aggregates dominant symptom patterns and produces auditable engineering specifications, and (iii) flip-centered gating that prioritizes pass to fail regressions and fail to pass fixes as first-class evidence. Unlike population-based search or in-agent self-refinement, AgentDevel maintains a single canonical version line and emphasizes non-regression as a primary objective. Experiments on execution-heavy benchmarks demonstrate that AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts. Overall, AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development.

AgentDevel: 자기 진화 LLM 에이전트를 릴리스 엔지니어링으로 재구성하기

AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering

초록

Support