당신의 에이전트도 노화합니다: 배포 시스템을 위한 에이전트 수명 엔지니어링

초록

장기 운용 AI 에이전트는 지속 운영 시스템으로 점차 배치되고 있지만, 여전히 새로 초기화된 모델처럼 평가되고 있다. 첫날 벤치마크는 근본적인 시스템 문제를 간과한다. 즉, 배포 후 에이전트가 얼마나 오랫동안 신뢰성을 유지하는가 하는 점이다. 모델 가중치가 고정되더라도 에이전트의 실질적 상태는 상호작용 기록을 압축하고, 확장되는 메모리 저장소에서 검색하며, 업데이트 후 사실을 수정하고, 정기 유지보수를 거치면서 계속 변화한다. 따라서 신뢰성은 기본 모델의 스냅샷 속성뿐만 아니라 전체 에이전트 시스템의 수명 속성이 된다. 본 연구는 AgingBench를 소개한다. 이는 에이전트 수명 공학을 위한 종단 신뢰성 벤치마크로, 배포된 에이전트가 성능 저하를 겪는지 여부뿐만 아니라 어떤 형태의 저하가 발생하는지, 그리고 수리가 어느 지점을 대상으로 해야 하는지를 측정한다. AgingBench는 에이전트 노화를 네 가지 메커니즘, 즉 압축 노화, 간섭 노화, 수정 노화, 유지보수 노화로 체계화한다. 이러한 실패를 진단하기 위해 AgingBench는 시간적 의존성 그래프와 쌍을 이룬 반사실적 탐침을 사용하여 메모리 파이프라인의 쓰기, 검색, 활용 단계에 대한 진단 프로파일을 생성한다. 7개 시나리오, 14개 모델, 다양한 메모리 정책, 그리고 러너 제어 및 자율 에이전트 모두에 걸쳐 8~200세션에 이르는 약 400회의 실행 결과는 에이전트 노화가 단일 차원이 아님을 보여준다. 즉, 행동 테스트는 깨끗하게 유지되면서 사실 정밀도는 저하될 수 있고, 파생 상태 추적은 단일 모델 내에서 급격히 붕괴될 수 있으며, 동일한 오답이라도 진단 프로파일이 가리키는 바에 따라 다른 수리가 필요할 수 있다. 이러한 결과는 신뢰할 수 있는 에이전트 배포를 위해서는 더 강력한 첫날 모델뿐만 아니라 수명 평가, 메커니즘 수준 진단, 그리고 단계별 수리가 필요함을 시사한다.

English

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.