トークン予算：63件のLLMエージェント予算超過インシデントの実証的カタログ、およびアフィン型Rust緩和策の事例研究

要旨

LLMエージェントの予算超過は、文書化された実運用障害クラスである。すなわち、単一の再試行ループが運用者が気付く前に数千ドルを消費し得ることであり、それを防止するプロセス内の整合性特性（コスト負担値のエイリアシング禁止、二重使用禁止、委任後使用禁止）は、仮に実施される場合でも、型システムではなくアドホックなラッパーによって強制される。本研究の中心的な貢献は実証的である。すなわち、2023年から2026年にかけて21のオーケストレーションフレームワークから収集した63件の確認済み実運用インシデントのカタログであり、各インシデントは引用されたGitHub Issueと、報告がある場合はドル建て損失額を伴い、8クラスタの障害分類法（評価者間コーエンのカッパ係数=0.837、N=113）に整理されている。さらに、47件の補足的な構造エントリも含まれる。この分類法に対して評価した緩和策の一つとして、トークン予算（token-budgets）を構築した。これは1,180行のRustクレート（unsafeなし）であり、アフィン所有権を運用可能にすることで、クローン、二重使用、または委任後の予算使用を、運用者が回避すべき実行時の危険性ではなくコンパイルエラーとする。ドル上限は推定器の仮定の下での実行時算術であるが、アフィン層によりその算術が迂回不可能となる。単一エージェントワークロードでは、4行のPythonカウンターが0/30の超過でクレートと同等であり、したがって差別化価値は、マルチエージェント委任における運用者エラー下での迂回不可能性にある。すなわち、11件のインシデントで文書化された委任ファンアウトレースは、コンパイル時に借用チェッカーによって拒否される一方、asyncio下での同一パターンは30/30超過し、3つの規律ある代替手法は0/30超過となる。5つのランタイム、3つのプロバイダ、および温度層別化されたライブAPIテスト（N=160）において、本アプローチは上限違反ゼロ、誤った拒否ゼロを報告し、並行研究と運用上の同等性を示す。静的過剰予約は4～6倍（適応型で2.11倍）である。実行バイナリ上のバイナリレベルの上限健全性は未解決の課題である。

English

LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.