ChatPaper.aiChatPaper

令牌预算:63起LLM代理预算超限事件的实证目录——以仿射类型Rust缓解方案为案例研究

Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study

June 2, 2026
作者: Sajjad Khan
cs.AI

摘要

LLM-agent 预算超支是一种有文献记录的生产故障类别:单个重试循环可能在操作员察觉前耗费数千美元,而能够防止此类问题的进程内完整性属性(无别名、无双花、无委托后使用成本承担值)即便得到强制执行,也往往通过临时包装器而非类型系统来实现。我们的核心贡献是实证性成果:一份包含来自21个编排框架(2023-2026年)的63起已确认生产事故的目录,每起事故均有引用的GitHub issue作为支撑,并在已知情况下附有美元损失金额,归类为八簇故障分类体系(评分者间Cohen's kappa = 0.837,N = 113),外加47条补充性结构条目。作为针对该分类体系评估的一种缓解措施,我们构建了token-budgets——一个1180行的Rust crate(无unsafe代码),它将仿射所有权操作化,使得克隆、双花或在委托预算后使用该预算成为编译错误,而非操作员必须记住避免的运行时风险。美元上限是估计器假设下的运行时算术;仿射层使得该算术不可绕过。在单代理工作负载上,一个4行的Python计数器与crate的效果相当,超支率为0/30,因此其区分价值在于多代理委托中操作员错误下的不可绕过性:在11起事故中有文档记录的委托扇出竞争在编译时被借用检查器拒绝,而相同模式在asyncio下超支率为30/30,三种严谨替代方案超支率为0/30。跨五个运行时、三个提供商以及温度分层的实时API测试(N = 160),该方案报告零上限违规和零误拒绝,运行效率与同期工作相当。静态过度预留为4-6倍(自适应时为2.11倍)。运行中二进制文件层面的二进制级上限可靠性仍待解决。
English
LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.