ChatPaper.aiChatPaper

令牌預算:63起LLM代理預算超支事件的實證目錄,以及基於仿射型別Rust的緩解案例研究

Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study

June 2, 2026
作者: Sajjad Khan
cs.AI

摘要

LLM代理預算超支是一個有文獻記載的生產環境故障類別:單一重試迴圈可能在運維人員察覺之前就耗費數千美元,而能夠預防此類問題的進程內完整性屬性(無別名引用、無雙重支付、無委託後使用承載成本的值),即使有被強制執行,也是透過臨時包裝器而非型別系統來實現。我們的核心貢獻在於實證:一份涵蓋21個編排框架(2023-2026年)中63起經確認的生產事故目錄,每起事故均附有引用的GitHub議題,並在報告中提供經濟損失(美元),將其組織成一個八類故障分類法(評估者間Cohen's kappa係數 = 0.837,N = 113),外加47筆補充性結構條目。作為針對此分類法評估的一項緩解措施,我們建構了token-budgets,這是一個1,180行的Rust crate(無unsafe程式碼),它實現了仿射所有權,使得克隆、雙重支付或在委託後使用預算變成編譯錯誤,而非運維人員必須記得避免的執行時期風險。美元上限是在估算器假設下的執行時期算術運算;仿射層使得該算術運算不可被繞過。在單一代理工作負載下,一個4行的Python計數器與該crate在0/30的超支情況下表現相當,因此其區分價值在於多代理委託中運維人員出錯時的不可繞過性:在11起事故中記錄的委託扇出競爭,在編譯時期就被借用檢查器拒絕,而相同的模式在asyncio下則超支30/30,三個規範的替代方案則超支0/30。在五個執行時期、三個提供商以及一個溫度分層的即時API測試(N = 160)中,該方法報告了零上限違規與零誤拒,達到與同類工作相同的運行效能。靜態超額預留為4-6倍(自適應版本為2.11倍)。運行中二進制層級的上限健全性問題則留待後續解決。
English
LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.