PrefixGuard: LLMエージェントのトレースからオンライン障害警告モニターへ

要旨

大規模言語モデル（LLM）エージェントは現在、長期的でツールを使用するタスクを実行しており、最終結果の確認が介入には遅すぎることがある。オンライン警告には、異種トレースに対する軽量なプレフィックスモニターが必要であるが、手動で作成されたイベントスキーマは脆弱であり、デプロイ時のLLMによる判定はコストがかかる。我々は、オフラインのStepView誘導ステップとそれに続く教師ありモニター訓練からなるトレース–モニターフレームワークであるPrefixGuardを提案する。StepViewは、生のトレースサンプルから決定論的な型付きステップアダプターを誘導し、モニターは終端結果からイベント抽象化とプレフィックスリスクスコアラーを学習する。WebArena、τ^2-Bench、SkillsBench、TerminalBenchにおいて、最も強力なPrefixGuardモニターは0.900/0.710/0.533/0.557のAUPRCを達成する。各表現内で最強のバックエンドを使用することで、生テキストの制御と比較して平均+0.137 AUPRCの改善を示す。同じプレフィックス警告プロトコルでは、LLM判定器は大幅に劣る。また、スコアベースの適合率-再現率曲線下面積（AUPRC）に関する可観測性の上限を導出し、これによりモニター誤りと観測プレフィックスに証拠がない失敗を分離する。有限状態監査のために、事後的な決定性有限オートマトン（DFA）抽出はWebArenaとτ^2-Benchではコンパクト（29状態および20状態）であるが、SkillsBenchとTerminalBenchでは151状態および187状態に拡大する。最後に、最初のアラート診断は、強いランキングが展開における有用性を意味しないことを示す：WebArenaはランキングが良好であるが低誤警報アラートをサポートできず、一方τ^2-BenchとTerminalBenchはより実用的な早期アラートを保持する。これらの結果は、PrefixGuardが、プレフィックス警告が実用的な介入に変換されるタイミングに関する明示的な診断機能を備えた実用的なモニター合成レシピとして位置づけられることを示している。

English

Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, τ^2-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and τ^2-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas τ^2-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.