TRACEALIGN —— 追踪漂移：将大语言模型的对齐失败归因于训练阶段的信念来源

摘要

经过微调以与人类价值观对齐的大型语言模型（LLMs）常出现对齐漂移现象，在面对对抗性提示、解码扰动或改写式越狱时，会生成不安全或违反策略的补全内容。尽管先前的研究已从行为层面描述了对齐失败，但对于这些失败背后训练时信念来源的了解仍十分有限。我们提出了TraceAlign，一个统一的框架，用于将不安全的补全内容追溯至模型训练语料库中的根源。我们方法的核心是信念冲突指数（BCI），它基于通过后缀数组匹配检索到的训练文档，量化生成片段与对齐策略之间的语义不一致性。我们提出了三种互补的干预措施：（i）TraceShield，一个推理时安全过滤器，拒绝包含高BCI片段的补全；（ii）对比信念解冲突损失，一种对比微调目标，在DPO过程中惩罚高BCI的延续；（iii）Prov-Decode，一种来源感知的解码策略，否决预测会产生高BCI片段的束扩展。这些防御措施共同作用，在我们精心策划的对齐漂移基准（ADB）上将对齐漂移减少了高达85%，同时在标准任务上保持了实用性，差异小于0.2，并提升了拒绝质量。我们进一步通过后缀数组片段统计，推导出漂移可能性的理论上限，将记忆频率和长度与对抗性再激活风险联系起来。因此，TraceAlign提供了首个可扩展、可追踪且基于实际数据的工具包，用于理解和从源头缓解对齐失败。为鼓励进一步探索和开发，我们在以下地址开源了我们的实现：https://anonymous.4open.science/r/tracealign-2DA7。

English

Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7

TRACEALIGN —— 追踪漂移：将大语言模型的对齐失败归因于训练阶段的信念来源

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

摘要

Support