学会决策与拒绝：保障多步骤工具使用中智能体推理模型的安全性

摘要

代理型语言模型的安全运行机制与聊天模型存在根本差异：这类模型需进行规划、调用工具并执行长周期行动，其中任何单步失误（如访问文件或输入凭证）都可能造成不可逆损害。现有对齐方法主要针对静态生成和任务完成进行优化，在面临序列决策、对抗性工具反馈及过度自信的中间推理时往往失效。我们提出MOSAIC后训练框架，通过将安全决策显式化与可学习化，实现代理在多步骤工具使用中的安全对齐。该框架将推理过程构建为“规划-检查-执行/拒绝”循环，使显式安全推理和拒绝成为一等操作。为解决轨迹级标签缺失问题，我们采用基于偏好的强化学习与轨迹对比较方法，有效捕捉标量奖励常忽略的安全差异。在Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4三种模型架构上的零样本测试表明，MOSAIC在分布外基准（涵盖有害任务、提示注入、良性工具使用及跨域隐私泄露）中可将有害行为降低达50%，在注入攻击场景下有害任务拒绝率提升超20%，同时减少隐私泄露并保持或提升良性任务性能，展现出跨模型、跨领域及代理场景的强健泛化能力。

English

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.