价值漂移:追踪大语言模型后训练中的价值对齐轨迹
Value Drifts: Tracing Value Alignment During LLM Post-Training
October 30, 2025
作者: Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Vered Shwartz, Siva Reddy
cs.AI
摘要
隨着大型語言模型在社會中扮演日益重要的角色,它們越來越需要應對不僅要求運用通用知識、還需符合特定人類價值體系的問題。因此,研究大型語言模型與人類價值觀的對齊已成為關鍵領域。然而既有研究大多聚焦於評估已完成訓練模型的對齊表現,忽略了模型學習表達人類價值觀的訓練動態過程。本研究探討模型在後訓練過程中,價值觀對齊何時以及如何形成。我們的分析區分了後訓練算法與數據集的影響,量化訓練期間價值觀漂移的幅度與時機。通過對不同規模的Llama-3和Qwen-3模型,以及主流監督微調(SFT)與偏好優化數據集和算法進行實驗,我們發現SFT階段通常確立模型的價值觀基礎,後續的偏好優化很少重新調整這些價值觀。此外,通過使用可精準操控價值觀的合成偏好數據集,我們發現即使保持偏好數據不變,不同的偏好優化算法也會導致不同的價值對齊結果。本研究揭示了後訓練過程中價值觀的習得機制,為改進數據策劃、優化模型與算法選擇以提升人類價值觀對齊提供了實踐洞見。
English
As LLMs occupy an increasingly important role in society, they are more and
more confronted with questions that require them not only to draw on their
general knowledge but also to align with certain human value systems.
Therefore, studying the alignment of LLMs with human values has become a
crucial field of inquiry. Prior work, however, mostly focuses on evaluating the
alignment of fully trained models, overlooking the training dynamics by which
models learn to express human values. In this work, we investigate how and at
which stage value alignment arises during the course of a model's
post-training. Our analysis disentangles the effects of post-training
algorithms and datasets, measuring both the magnitude and time of value drifts
during training. Experimenting with Llama-3 and Qwen-3 models of different
sizes and popular supervised fine-tuning (SFT) and preference optimization
datasets and algorithms, we find that the SFT phase generally establishes a
model's values, and subsequent preference optimization rarely re-aligns these
values. Furthermore, using a synthetic preference dataset that enables
controlled manipulation of values, we find that different preference
optimization algorithms lead to different value alignment outcomes, even when
preference data is held constant. Our findings provide actionable insights into
how values are learned during post-training and help to inform data curation,
as well as the selection of models and algorithms for preference optimization
to improve model alignment to human values.