TRACEALIGN -- 드리프트 추적: LLM의 정렬 실패를 훈련 시기 신념 소스에 귀인하기

초록

인간의 가치와 일치하도록 미세 조정된 대형 언어 모델(LLMs)은 적대적 프롬프트, 디코딩 변동 또는 패러프레이징된 탈옥(paraphrased jailbreaks)에 노출될 때 정렬 이탈(alignment drift)을 보이며 안전하지 않거나 정책을 위반하는 완성문을 생성하는 경우가 많습니다. 기존 연구에서는 이러한 정렬 실패를 행동적으로 특성화했지만, 이러한 실패의 근본 원인이 되는 훈련 시기의 신념 소스에 대해서는 거의 알려진 바가 없습니다. 본 논문에서는 모델의 훈련 코퍼스에서 안전하지 않은 완성문의 근본 원인을 추적할 수 있는 통합 프레임워크인 TraceAlign을 소개합니다. 우리의 접근법의 핵심은 접미사 배열 매칭(suffix-array matching)을 통해 검색된 훈련 문서를 기반으로 생성된 텍스트 스팬과 정렬된 정책 간의 의미적 불일치를 정량화하는 Belief Conflict Index(BCI)입니다. 우리는 세 가지 상호 보완적인 개입 방안을 제안합니다: (i) 높은 BCI 스팬을 포함하는 완성문을 거부하는 추론 시점 안전 필터인 TraceShield, (ii) DPO(Data Parallel Optimization) 과정에서 높은 BCI 연속문을 벌점 주는 대조적 신념 해소 손실(Contrastive Belief Deconfliction Loss), (iii) 높은 BCI 스팬을 생성할 것으로 예측되는 빔 확장을 거부하는 출처 인식 디코딩 전략인 Prov-Decode. 이러한 방어 기법들은 우리가 제작한 Alignment Drift Benchmark(ADB)에서 정렬 이탈을 최대 85%까지 감소시키면서도 표준 작업에서의 유용성을 유지하며(델타가 0.2 미만), 거부 품질도 개선되었습니다. 또한, 우리는 접미사 배열 스팬 통계를 통해 적대적 재활성화 위험과 기억 빈도 및 길이를 연결하여 이탈 가능성에 대한 이론적 상한을 도출했습니다. 따라서 TraceAlign은 정렬 실패를 근본적으로 이해하고 완화하기 위한 최초의 확장 가능하고 추적 가능하며 근거 기반 툴킷을 제공합니다. 더 나아가 추가 탐구와 개발을 장려하기 위해 우리는 구현 코드를 오픈소스로 공개합니다: https://anonymous.4open.science/r/tracealign-2DA7

English

Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7

TRACEALIGN -- 드리프트 추적: LLM의 정렬 실패를 훈련 시기 신념 소스에 귀인하기

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

초록

Support