ChatPaper.aiChatPaper

利用稀疏特征级约束进行直接偏好优化

Direct Preference Optimization Using Sparse Feature-Level Constraints

November 12, 2024
作者: Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang
cs.AI

摘要

大型语言模型(LLMs)与人类偏好的对齐仍然是一个关键挑战。虽然像人类反馈强化学习(RLHF)和直接偏好优化(DPO)这样的后训练技术取得了显著成功,但它们往往引入了计算效率低和训练不稳定的问题。在本文中,我们提出了基于特征级约束的偏好优化(FPO),这是一种旨在简化对齐过程并确保稳定性的新方法。FPO利用预训练的稀疏自动编码器(SAEs),并引入特征级约束,从而实现高效的稀疏强制对齐。我们的方法通过使用在训练良好的稀疏自动编码器中激活的稀疏特征以及使用特征级离线参考的顺序KL散度来获得效率。基准数据集上的实验结果表明,与最先进的基线相比,FPO在获胜率上取得了5.08%的绝对改善,而计算成本则大大降低,使其成为一种有效且可控的LLM对齐解决方案。
English
The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.
PDF173November 14, 2024