价值漂移:追踪大语言模型后训练中的价值对齐
Value Drifts: Tracing Value Alignment During LLM Post-Training
October 30, 2025
作者: Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Vered Shwartz, Siva Reddy
cs.AI
摘要
随着大语言模型在社会中扮演日益重要的角色,它们越来越多地面临不仅需要调用通用知识、还需符合特定人类价值观的问题。因此,研究大语言模型与人类价值观的对齐已成为关键领域。然而现有研究大多聚焦于评估已训练完成模型的对齐表现,忽视了模型学习表达人类价值观的训练动态过程。本文通过探究模型在后训练过程中价值观对齐形成的方式与阶段,揭示了后训练算法与数据集的影响效应,并量化了训练期间价值观漂移的幅度与时机。基于不同规模的Llama-3和Qwen-3模型,结合主流监督微调及偏好优化的数据集与算法进行实验,我们发现监督微调阶段通常奠定模型的价值观基础,而后续的偏好优化很少重新调整这些价值观。此外,通过使用可控制价值观参数的合成偏好数据集,我们发现即使偏好数据保持不变,不同的偏好优化算法也会导致不同的价值观对齐结果。这些发现为理解后训练过程中价值观的学习机制提供了可行见解,有助于指导数据筛选、模型选择以及偏好优化算法的选用,从而提升模型与人类价值观的对齐程度。
English
As LLMs occupy an increasingly important role in society, they are more and
more confronted with questions that require them not only to draw on their
general knowledge but also to align with certain human value systems.
Therefore, studying the alignment of LLMs with human values has become a
crucial field of inquiry. Prior work, however, mostly focuses on evaluating the
alignment of fully trained models, overlooking the training dynamics by which
models learn to express human values. In this work, we investigate how and at
which stage value alignment arises during the course of a model's
post-training. Our analysis disentangles the effects of post-training
algorithms and datasets, measuring both the magnitude and time of value drifts
during training. Experimenting with Llama-3 and Qwen-3 models of different
sizes and popular supervised fine-tuning (SFT) and preference optimization
datasets and algorithms, we find that the SFT phase generally establishes a
model's values, and subsequent preference optimization rarely re-aligns these
values. Furthermore, using a synthetic preference dataset that enables
controlled manipulation of values, we find that different preference
optimization algorithms lead to different value alignment outcomes, even when
preference data is held constant. Our findings provide actionable insights into
how values are learned during post-training and help to inform data curation,
as well as the selection of models and algorithms for preference optimization
to improve model alignment to human values.