토큰 예산: LLM 에이전트의 예산 초과 사고 63건에 대한 경험적 목록 및 Affine 타입 Rust 완화 방안 사례 연구

초록

LLM 에이전트 예산 초과는 문서화된 프로덕션 실패 클래스로, 단일 재시도 루프가 운영자가 인지하기 전에 수천 달러를 지출할 수 있으며, 이를 방지할 수 있는 프로세스 내 무결성 속성(별칭 금지, 이중 지출 금지, 비용 부담 값의 위임 후 사용 금지)은 있다 하더라도 타입 시스템이 아닌 임시 래퍼에 의해 적용됩니다. 본 연구의 핵심 기여는 경험적입니다. 21개의 오케스트레이션 프레임워크(2023–2026)에서 발생한 63건의 확인된 프로덕션 인시던트 카탈로그로, 각각 인용된 GitHub 이슈와 보고된 달러 손실을 포함하며, 8개 클러스터의 실패 분류 체계(평가자 간 Cohen's kappa = 0.837, N = 113)로 구성되어 있습니다. 추가로 47개의 보조 구조적 항목이 있습니다. 이 분류 체계에 대해 평가된 한 가지 완화 방안으로, 우리는 1,180줄의 Rust 크레이트(unsafe 없음)인 token-budgets를 구축했습니다. 이 크레이트는 아핀 소유권을 구현하여 복제, 이중 지출 또는 예산 위임 후 사용이 운영자가 기억해야 하는 런타임 위험이 아닌 컴파일 오류가 되도록 합니다. 달러 상한은 추정기 가정 하의 런타임 산술이며, 아핀 계층은 해당 산술을 우회할 수 없게 만듭니다. 단일 에이전트 워크로드에서는 4줄의 Python 카운터가 0/30 초과로 크레이트와 일치하므로, 차별화되는 가치는 다중 에이전트 위임에서 운영자 오류 하의 우회 불가능성입니다. 11건의 인시던트에서 문서화된 위임-팬아웃 경쟁은 컴파일 시 borrow checker에 의해 거부되는 반면, asyncio 하의 동일한 패턴은 30/30을 초과하고 세 가지 규율 있는 대안은 0/30을 초과합니다. 5개의 런타임, 3개의 제공자, 그리고 온도 계층화된 라이브 API 테스트(N = 160)에서 이 접근 방식은 상한 위반 0건, 거짓 거부 0건을 보고하며, 동시 연구와 운영적 동등성을 보입니다. 정적 초과 예약은 4–6배(적응형 2.11배)입니다. 실행 중인 바이너리에 대한 바이너리 수준의 상한 건전성은 미해결 과제로 남겨둡니다.

English

LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.