ChatPaper.aiChatPaper

PrefixGuard:從LLM代理軌跡到線上故障預警監控器

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

May 7, 2026
作者: Xinmiao Huang, Jinwei Hu, Rajarshi Roy, Changshun Wu, Yi Dong, Xiaowei Huang
cs.AI

摘要

大型語言模型(LLM)智能體現已能夠執行涉及多步驟工具使用的長期任務,在這些任務中,最終結果檢查往往來不及進行干預。線上預警需要針對異質軌跡的輕量級前綴監控器,但手動撰寫的事件結構既脆弱,部署時使用LLM進行判斷又成本高昂。我們提出PrefixGuard,這是一個從軌跡到監控器的框架,包含離線的StepView歸納步驟,後續再進行監督式監控器訓練。StepView從原始軌跡樣本中歸納出確定性類型步驟適配器,而監控器則從終端結果中學習事件抽象及前綴風險評分器。在WebArena、τ²-Bench、SkillsBench與TerminalBench上,最強的PrefixGuard監控器分別達到0.900、0.710、0.533與0.557的AUPRC。使用各表示法中最強的後端時,相較於純文字對照組,平均AUPRC提升了+0.137。在相同的前綴預警協議下,LLM判斷器仍顯著較弱。我們也推導出一個基於分數的精確率-召回率曲線下面積(AUPRC)可觀測性上限,用以區分監控器誤差與因觀察到的前綴中缺乏證據而導致的失敗。在有限狀態審計方面,事後提取的確定性有限自動機(DFA)在WebArena與τ²-Bench上仍保持精簡(29個與20個狀態),但在SkillsBench與TerminalBench上則擴展至151個與187個狀態。最後,首次警報診斷顯示,強排序並不保證部署效益:WebArena排序表現佳,卻無法支援低誤報警報,而τ²-Bench與TerminalBench則保留了更具可操作性的早期警報。綜合這些結果,PrefixGuard被定位為一套實用的監控器合成方案,並提供明確的診斷機制,用以判斷前綴預警何時能轉化為可操作的干預措施。
English
Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, τ^2-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and τ^2-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas τ^2-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.
PDF21May 12, 2026