多智能體大型語言模型系統中並發異常的經驗證偵測與預防

摘要

多智能體大型語言模型系統透過記憶儲存、向量索引及工具註冊表共享狀態。我們將此類共享建模為在確定性生成語義（即持久執行引擎透過確定性重播所執行的機制）下的長期讀取-生成-寫入操作，並在TLA+中形式化四種並發異常：過時生成（stale-generation）、幻影工具（phantom-tool）、因果串聯（causal-cascade）及工具效應重排序（tool-effect reordering），這些是經典隔離異常的結構類比，每種異常均附有TLC反例。這些異常上的排除格（exclusion lattice）是平凡的；其貢獻在於機械驗證的可實現性及其中一條最大鏈的嚴格分離，即L₀ ⊊ ⋯ ⊊ L₄，據我們所知，這是此類運行時首個經機器檢查的一致性層級體系。一項包含274項Verus義務（零假設、零容許；信任基礎：兩條結構公理及一個互斥對應）的開發工作，證明了檢測器相對於規範的完備性與可靠性，以及每個運行時對其避免集合的滿足性。三個已部署的Rust運行時實現了L₀-L₁（悲觀鎖定、可序列化快照隔離、預設快照隔離），各運行時均已針對過時生成進行驗證，並細化至其狀態機；L₂-L₄經執行模式驗證，並配備無依賴預防對偶（A3、A6、A2：0/1000對比1000/1000），其中L₂在三個模型家族中進行了即時運行（A3在所有120次撤銷會話中均被預防）。我們重現了位元組跳動deer-flow中的一個隱匿遺失更新，將其修復形式化為一個經驗證的L₀至L₁細化，並在LangGraph的ToolNode上展示了對未修改輸出的工具效應重排序，該問題已透過L₃提交順序排序器消除。經驗證的檢測器、細化及可實現性產物是本文的貢獻；所述現象與格架構則屬於經典範疇。

English

Multi-agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long-running read-generate-write operations under deterministic-generation semantics -- the regime durable-execution engines enforce by deterministic replay -- and formalize four concurrency anomalies in TLA+: stale-generation, phantom-tool, causal-cascade, and tool-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter-example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, L_0 subsetneq cdots subsetneq L_4, to our knowledge the first machine-checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0-L1 (pessimistic locking, serializable snapshot isolation, default-SI), each verified against stale-generation and refined to its state machine; L2-L4 are exec-mode-verified with dependency-free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance's deer-flow, formalizing its fix as a verified L_0 to L_1 refinement, and exhibit tool-effect reordering in LangGraph's ToolNode on unmodified output, removed by an L3 commit-order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.