TRACEALIGN —— 追踪偏移：將大語言模型對齊失敗歸因於訓練階段的信念來源

摘要

針對人類價值觀進行微調的大型語言模型（LLMs）常出現對齊漂移現象，當遭遇對抗性提示、解碼擾動或改寫的越獄指令時，會生成不安全或違反政策的補全內容。儘管先前研究已從行為角度描述了對齊失敗的特徵，但對這些失敗背後訓練時期的信念來源知之甚少。我們提出了TraceAlign，這是一個統一框架，用於將不安全的補全內容追溯至模型訓練語料庫中的根本原因。我們方法的核心是信念衝突指數（BCI），它基於使用後綴數組匹配檢索到的訓練文檔，量化生成片段與對齊政策之間的語義不一致性。我們提出了三種互補的干預措施：（i）TraceShield，一種推理時的安全過濾器，拒絕包含高BCI片段的補全；（ii）對比信念解衝突損失，一種在DPO期間懲罰高BCI延續的對比微調目標；以及（iii）Prov-Decode，一種來源感知的解碼策略，預測會產生高BCI片段的波束擴展予以否決。這些防禦措施共同作用，在我們精心策劃的對齊漂移基準（ADB）上將對齊漂移減少高達85%，同時在標準任務上保持實用性，差異小於0.2，並提高了拒絕質量。我們進一步通過後綴數組片段統計推導出漂移可能性的理論上限，將記憶頻率和長度與對抗性重新激活風險聯繫起來。因此，TraceAlign提供了首個可擴展、可追溯且基於實證的工具包，用於理解和從源頭上緩解對齊失敗。為鼓勵進一步探索和開發，我們在以下地址開源了我們的實現：https://anonymous.4open.science/r/tracealign-2DA7。

English

Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7

TRACEALIGN —— 追踪偏移：將大語言模型對齊失敗歸因於訓練階段的信念來源

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

摘要

Support