域遷移下偏好調優泛化性與多樣性的實證研究
An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift
January 9, 2026
作者: Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras
cs.AI
摘要
偏好調校透過最佳化顯性偏好信號(而非僅依賴可能性),將預訓練語言模型對齊至人類對品質、協助性或安全性的判斷。先前研究顯示,當評估超出訓練領域時,偏好調校會降低模型效能並削弱協助性。然而,適應策略能在多大程度上緩解這種領域偏移仍屬未知。我們透過系統性研究領域偏移下的對齊泛化能力,針對此挑戰提出全面分析。我們比較五種主流對齊目標與多種從源領域到目標領域的適應策略(包括目標領域監督式微調與偽標註),並在摘要生成與問答協助性任務中進行驗證。研究結果顯示,不同對齊目標在領域偏移下的泛化能力存在系統性差異。我們證實基於偽標註的適應策略能顯著減緩領域偏移造成的效能衰退。
English
Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation