TRACEALIGN ― ドリフトの追跡：LLMにおけるアライメント失敗の原因をトレーニング時の信念源に帰属させる

要旨

人間の価値観に沿うようにファインチューニングされた大規模言語モデル（LLMs）は、敵対的なプロンプト、デコードの摂動、または言い換えられたジャイルブレイクにさらされた際に、安全性を欠くまたはポリシーに違反する補完を生成する「アライメントドリフト」を示すことが多い。これまでの研究では、アライメントの失敗を行動的に特徴づけてきたが、これらの失敗の根底にあるトレーニング時の信念の源についてはほとんど知られていない。本研究では、TraceAlignという統一フレームワークを導入し、安全でない補完をモデルのトレーニングコーパスにおける根本原因まで遡る手法を提案する。我々のアプローチの中核となるのは、Belief Conflict Index（BCI）であり、これはサフィックスアレイマッチングを用いて検索されたトレーニング文書に基づき、生成されたスパンとアライメントされたポリシー間の意味的な不整合を定量化するものである。我々は、以下の3つの補完的な介入策を提案する：(i) TraceShield、高BCIスパンを含む補完を拒否する推論時の安全フィルタ、(ii) Contrastive Belief Deconfliction Loss、DPO中に高BCIの継続をペナルティするコントラスティブファインチューニング目的関数、(iii) Prov-Decode、高BCIスパンを生成すると予測されるビーム拡張を拒否するプロベナンスを意識したデコード戦略。これらの防御策を組み合わせることで、我々が作成したAlignment Drift Benchmark（ADB）において、アライメントドリフトを最大85％削減しつつ、標準タスクでの有用性を維持し、デルタを0.2未満に抑え、拒否品質を向上させた。さらに、サフィックスアレイスパン統計を用いて、記憶頻度と長さを敵対的再活性化リスクに関連付けることで、ドリフトの可能性に関する理論的上限を導出した。TraceAlignは、アライメントの失敗を理解し、その源を緩和するための初めてのスケーラブルでトレーサブルかつ根拠に基づいたツールキットを提供する。さらなる探求と開発を促進するため、我々は実装をオープンソースとして公開している：https://anonymous.4open.science/r/tracealign-2DA7

English

Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7

TRACEALIGN ― ドリフトの追跡：LLMにおけるアライメント失敗の原因をトレーニング時の信念源に帰属させる

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

要旨

Support