SafePred: Una protezione predittiva per agenti informatici tramite modelli del mondo

Abstract

Con la diffusione su larga scala degli Agenti con Utilizzo di Computer (CUA) in ambienti reali complessi, i rischi latenti a lungo termine portano spesso a conseguenze gravi e irreversibili. La maggior parte delle protezioni esistenti per i CUA adotta un approccio reattivo, vincolando il comportamento dell'agente solo all'interno dello spazio osservabile corrente. Sebbene queste protezioni possano prevenire rischi immediati a breve termine (ad esempio, cliccare su un link di phishing), non possono evitare proattivamente i rischi a lungo termine: azioni apparentemente ragionevoli possono condurre a conseguenze ad alto rischio che si manifestano in ritardo (ad esempio, la pulizia dei log rende le future verifiche intracciabili), che le protezioni reattive non riescono a identificare nello spazio osservabile corrente. Per affrontare queste limitazioni, proponiamo un approccio di protezione predittiva, il cui concetto fondamentale è allineare i rischi futuri previsti con le decisioni correnti. Basandoci su questo approccio, presentiamo SafePred, un framework di protezione predittiva per CUA che stabilisce un ciclo rischio-decisione per garantire un comportamento sicuro dell'agente. SafePred supporta due capacità chiave: (1) Previsione del rischio a breve e lungo termine: utilizzando politiche di sicurezza come base per la previsione del rischio, SafePred sfrutta la capacità predittiva del modello mondiale per generare rappresentazioni semantiche dei rischi sia a breve che a lungo termine, identificando e eliminando così le azioni che portano a stati ad alto rischio; (2) Ottimizzazione decisionale: traducendo i rischi previsti in linee guida decisionali sicure e attuabili attraverso interventi a livello di step e una ripianificazione a livello di task. Esperimenti estensivi mostrano che SafePred riduce significativamente i comportamenti ad alto rischio, raggiungendo oltre il 97,6% di performance di sicurezza e migliorando l'utilità del task fino al 21,4% rispetto ai baseline reattivi.

English

With the widespread deployment of Computer-using Agents (CUAs) in complex real-world environments, prevalent long-term risks often lead to severe and irreversible consequences. Most existing guardrails for CUAs adopt a reactive approach, constraining agent behavior only within the current observation space. While these guardrails can prevent immediate short-term risks (e.g., clicking on a phishing link), they cannot proactively avoid long-term risks: seemingly reasonable actions can lead to high-risk consequences that emerge with a delay (e.g., cleaning logs leads to future audits being untraceable), which reactive guardrails cannot identify within the current observation space. To address these limitations, we propose a predictive guardrail approach, with the core idea of aligning predicted future risks with current decisions. Based on this approach, we present SafePred, a predictive guardrail framework for CUAs that establishes a risk-to-decision loop to ensure safe agent behavior. SafePred supports two key abilities: (1) Short- and long-term risk prediction: by using safety policies as the basis for risk prediction, SafePred leverages the prediction capability of the world model to generate semantic representations of both short-term and long-term risks, thereby identifying and pruning actions that lead to high-risk states; (2) Decision optimization: translating predicted risks into actionable safe decision guidances through step-level interventions and task-level re-planning. Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.

SafePred: Una protezione predittiva per agenti informatici tramite modelli del mondo

SafePred: A Predictive Guardrail for Computer-Using Agents via World Models

Abstract

Support