你的代理也正在老化：部署系統中的代理壽命工程

摘要

長期運作的AI代理日益被部署為持續性作業系統，但其評估方式仍停留在如同剛初始化的模型。首日基準測試忽略了一個基本系統問題：代理在部署後能維持多久的可靠性？即使模型權重被凍結，代理的實際狀態仍會因壓縮互動歷史、從不斷增長的記憶庫中檢索、在更新後修正事實，以及進行例行維護而持續變化。因此，可靠性不僅是基礎模型的快照特性，更成為完整代理框架的生命週期屬性。我們提出AgingBench，一個針對代理生命週期工程設計的縱向可靠性基準測試：不僅測量已部署代理是否退化，更分析退化的形式及應對修復的目標區域。AgingBench將代理老化歸納為四種機制：壓縮老化、干擾老化、修正老化與維護老化。為診斷這些失效，AgingBench採用時間依賴關係圖與配對反事實探針，對記憶管線的寫入、檢索與利用階段生成診斷輪廓。在7個場景、14個模型、多種記憶策略，以及由執行器控制與自主代理的條件下，歷經約400次運行（每次涵蓋8至200個會話）的結果顯示：代理老化並非單一維度——行為測試可能保持正常，但事實精確度卻逐步下降；衍生狀態追蹤可能在單一模型內急遽崩潰；而相同的錯誤答案，根據診斷輪廓指向的不同，可能需要不同的修復策略。這些結果表明，可靠的代理部署需要生命週期評估、機制層級診斷與階段導向修復，而非僅依賴更強大的首日模型。

English

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.