ChatPaper.aiChatPaper

学会抉择何时行动与拒绝:保障多步骤工具使用中智能体推理模型的安全性

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

March 3, 2026
作者: Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
cs.AI

摘要

代理型语言模型在安全机制上与聊天模型存在根本差异:它们需进行规划、调用工具并执行长程动作,其中任何一步失误(如访问文件或输入凭证)都可能导致不可逆的损害。现有对齐方法主要针对静态生成和任务完成进行优化,由于序列决策、对抗性工具反馈及过度自信的中间推理,在这些场景中往往失效。我们提出MOSAIC后训练框架,通过将安全决策显式化与可学习化,实现代理在多步骤工具使用中的安全对齐。该框架将推理结构化为“规划-检查-执行/拒绝”的循环流程,将显式安全推理和拒绝作为首要行动。为摆脱轨迹级标签依赖,我们采用基于偏好的强化学习与轨迹对比方法,捕捉标量奖励常忽略的安全差异。我们在Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4三大模型家族上开展零样本评估,测试范围涵盖有害任务、提示注入、良性工具使用及跨域隐私泄露等分布外基准。MOSAIC将有害行为降低达50%,在注入攻击场景下有害任务拒绝率提升超20%,有效遏制隐私泄露,同时保持或提升良性任务性能,展现出跨模型、跨领域及代理场景的强健泛化能力。
English
Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.
PDF112March 7, 2026