任意深度对齐:释放大语言模型内在安全对齐至任意深度
Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
October 20, 2025
作者: Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu
cs.AI
摘要
大型语言模型(LLMs)展现出强烈但浅层的对齐性:它们在助手轮次一开始便直接拒绝有害查询,然而一旦有害内容继续生成(无论是通过对抗性攻击还是有害的助手预填充攻击),这种保护机制便会崩溃。这引发了一个根本性问题:能否解锁LLMs内在的浅层对齐性,以确保在任意生成深度下的安全性?为实现这一目标,我们提出了任意深度对齐(Any-Depth Alignment, ADA),一种高效的推理时防御机制,其开销可忽略不计。ADA基于我们的观察构建,即对齐性通过浅层拒绝训练中的重复使用集中在助手头部标记上,这些标记承载了模型的强对齐先验。通过在生成过程中重新引入这些标记,ADA促使模型重新评估有害性,并在生成的任何时刻恢复拒绝行为。在多种开源模型家族(如Llama、Gemma、Mistral、Qwen、DeepSeek及gpt-oss)中,ADA实现了稳健的安全性能,且无需对基础模型的参数进行任何修改。它针对从几十到数千个标记的挑战性对抗性预填充攻击,实现了接近100%的拒绝率。此外,ADA将显著对抗性提示攻击(如GCG、AutoDAN、PAIR和TAP)的平均成功率降至3%以下。这一切都是在保持良性任务效用且最小化过度拒绝的前提下完成的。即使基础模型经历了后续的指令微调(无论是良性还是对抗性的),ADA仍能维持其韧性。
English
Large Language Models (LLMs) exhibit strong but shallow alignment: they
directly refuse harmful queries when a refusal is expected at the very start of
an assistant turn, yet this protection collapses once a harmful continuation is
underway (either through the adversarial attacks or via harmful
assistant-prefill attacks). This raises a fundamental question: Can the innate
shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation
depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an
effective inference-time defense with negligible overhead. ADA is built based
on our observation that alignment is concentrated in the assistant header
tokens through repeated use in shallow-refusal training, and these tokens
possess the model's strong alignment priors. By reintroducing these tokens
mid-stream, ADA induces the model to reassess harmfulness and recover refusals
at any point in generation. Across diverse open-source model families (Llama,
Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety
performance without requiring any changes to the base model's parameters. It
secures a near-100% refusal rate against challenging adversarial prefill
attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces
the average success rate of prominent adversarial prompt attacks (such as GCG,
AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving
utility on benign tasks with minimal over-refusal. ADA maintains this
resilience even after the base model undergoes subsequent instruction tuning
(benign or adversarial).