PrefixGuard：從LLM代理軌跡到線上故障預警監控器

摘要

大型語言模型（LLM）智能體現已能夠執行涉及多步驟工具使用的長期任務，在這些任務中，最終結果檢查往往來不及進行干預。線上預警需要針對異質軌跡的輕量級前綴監控器，但手動撰寫的事件結構既脆弱，部署時使用LLM進行判斷又成本高昂。我們提出PrefixGuard，這是一個從軌跡到監控器的框架，包含離線的StepView歸納步驟，後續再進行監督式監控器訓練。StepView從原始軌跡樣本中歸納出確定性類型步驟適配器，而監控器則從終端結果中學習事件抽象及前綴風險評分器。在WebArena、τ²-Bench、SkillsBench與TerminalBench上，最強的PrefixGuard監控器分別達到0.900、0.710、0.533與0.557的AUPRC。使用各表示法中最強的後端時，相較於純文字對照組，平均AUPRC提升了+0.137。在相同的前綴預警協議下，LLM判斷器仍顯著較弱。我們也推導出一個基於分數的精確率-召回率曲線下面積（AUPRC）可觀測性上限，用以區分監控器誤差與因觀察到的前綴中缺乏證據而導致的失敗。在有限狀態審計方面，事後提取的確定性有限自動機（DFA）在WebArena與τ²-Bench上仍保持精簡（29個與20個狀態），但在SkillsBench與TerminalBench上則擴展至151個與187個狀態。最後，首次警報診斷顯示，強排序並不保證部署效益：WebArena排序表現佳，卻無法支援低誤報警報，而τ²-Bench與TerminalBench則保留了更具可操作性的早期警報。綜合這些結果，PrefixGuard被定位為一套實用的監控器合成方案，並提供明確的診斷機制，用以判斷前綴預警何時能轉化為可操作的干預措施。

English

Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, τ^2-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and τ^2-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas τ^2-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.

PrefixGuard：從LLM代理軌跡到線上故障預警監控器

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

摘要

Support