任意深度對齊:解鎖大型語言模型的內在安全對齊至任意深度
Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
October 20, 2025
作者: Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu
cs.AI
摘要
大型语言模型(LLMs)展现出强烈但浅层的对齐性:它们在助手轮次开始时,若预期拒绝有害查询,便会直接拒绝;然而,一旦有害内容通过对抗性攻击或助手预填充攻击得以延续,这种保护机制便会瓦解。这引发了一个根本性问题:能否解锁LLMs内在的浅层对齐性,以确保在任意生成深度下的安全性?为实现这一目标,我们提出了“任意深度对齐”(Any-Depth Alignment, ADA),一种在推理时有效且开销极小的防御机制。ADA的构建基于我们的观察:通过对浅层拒绝训练的重复使用,对齐性集中于助手头部令牌中,这些令牌拥有模型强烈的对齐先验。通过在生成过程中重新引入这些令牌,ADA促使模型重新评估有害性,并在任何生成点恢复拒绝行为。在多种开源模型系列(如Llama、Gemma、Mistral、Qwen、DeepSeek及gpt-oss)中,ADA实现了稳健的安全性能,且无需对基础模型的参数进行任何更改。它针对从数十到数千个令牌的挑战性对抗性预填充攻击,确保了接近100%的拒绝率。此外,ADA将显著对抗性提示攻击(如GCG、AutoDAN、PAIR和TAP)的平均成功率降低至3%以下。这一切都是在保持良性任务效用、最小化过度拒绝的前提下完成的。即使基础模型随后经历了良性或对抗性的指令调优,ADA仍能维持其韧性。
English
Large Language Models (LLMs) exhibit strong but shallow alignment: they
directly refuse harmful queries when a refusal is expected at the very start of
an assistant turn, yet this protection collapses once a harmful continuation is
underway (either through the adversarial attacks or via harmful
assistant-prefill attacks). This raises a fundamental question: Can the innate
shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation
depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an
effective inference-time defense with negligible overhead. ADA is built based
on our observation that alignment is concentrated in the assistant header
tokens through repeated use in shallow-refusal training, and these tokens
possess the model's strong alignment priors. By reintroducing these tokens
mid-stream, ADA induces the model to reassess harmfulness and recover refusals
at any point in generation. Across diverse open-source model families (Llama,
Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety
performance without requiring any changes to the base model's parameters. It
secures a near-100% refusal rate against challenging adversarial prefill
attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces
the average success rate of prominent adversarial prompt attacks (such as GCG,
AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving
utility on benign tasks with minimal over-refusal. ADA maintains this
resilience even after the base model undergoes subsequent instruction tuning
(benign or adversarial).