PrefixGuard: 从LLM-智能体轨迹到在线故障预警监控器

摘要

大型语言模型（LLM）代理现能执行涉及工具使用的长周期任务，而最终结果检查往往为时已晚，难以实施干预。在线预警需要针对异构轨迹部署轻量级前缀监控器，但手动编写的事件模式脆弱且部署时由LLM判定的成本高昂。我们提出PrefixGuard——一种轨迹到监控器的框架，包含离线StepView归纳步骤及后续的监督式监控器训练。StepView从原始轨迹样本中归纳出确定性的类型化步骤适配器，监控器则从终端结果中学习事件抽象与前缀风险评分。在WebArena、τ²-Bench、SkillsBench和TerminalBench上，最强PrefixGuard监控器的AUPRC分别达到0.900/0.710/0.533/0.557。采用各表示框架中最强的后端模型时，相比原始文本基线平均提升+0.137 AUPRC。在同一前缀预警协议下，LLM判别器表现仍显著较弱。我们还推导出基于评分的精确率-召回率曲线下面积（AUPRC）可观测性上限，该上限将监控器误差与因观测前缀缺乏证据导致的失败相分离。针对有限状态审计，事后确定的确定性有限自动机（DFA）提取在WebArena和τ²-Bench上保持紧凑（29和20个状态），但在SkillsBench和TerminalBench上扩展至151和187个状态。最后，首次告警诊断表明，强排序并不等同于部署实用性：WebArena排名虽高却难以支持低虚警率告警，而τ²-Bench和TerminalBench则保留了更多可操作的早期告警。综合而言，这些结果将PrefixGuard定位为实用的监控器合成方案，并附有明确诊断方法，以判断前缀预警何时能转化为可操作的干预措施。

English

Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, τ^2-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and τ^2-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas τ^2-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.

PrefixGuard: 从LLM-智能体轨迹到在线故障预警监控器

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

摘要

Support