您的智能体也在老化：面向部署系统的智能体寿命工程

摘要

长寿命AI代理正越来越多地被部署为持续性运行系统，但其评估方式仍停留在如同刚初始化的模型阶段。首日基准测试忽略了一个基本的系统问题：代理在部署后能保持多长时间的可靠性？即使模型权重被冻结，代理的有效状态也会持续变化——它会压缩交互历史、从不断增长的记忆库中检索信息、根据更新修正事实，并经历常规维护。因此，可靠性成为整个代理框架的生命周期属性，而不仅仅是基础模型的快照属性。我们提出AgingBench，一个用于代理寿命工程的纵向可靠性基准：不仅衡量部署后的代理是否退化，还衡量退化的具体形式以及修复应针对哪些环节。AgingBench将代理老化归纳为四种机制：压缩老化、干扰老化、修订老化和维护老化。为诊断这些失效，AgingBench采用时间依赖图与配对反事实探针，生成针对记忆管道的写入、检索和利用阶段的诊断画像。通过7个场景、14个模型、多种记忆策略，以及运行器控制和自主代理两种类型，在约400次运行（涵盖8至200个会话）中观察到：代理老化并非单一维度——行为测试可能保持良好，而事实精度却会下降；派生状态跟踪可能在同一模型内急剧崩溃；对于同一个错误答案，根据诊断画像指向的不同，可能需要不同的修复方案。这些结果表明，可靠的代理部署需要生命周期评估、机制级诊断以及针对阶段的修复，而不仅仅是更强的首日模型。

English

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.