PopAlign：通过多样化对比模式实现更全面的对齐

摘要

大型语言模型（LLMs）的对齐涉及训练模型以偏好对比输出对来根据人类偏好调整其响应。为了获得这种对比对，传统方法如RLHF和RLAIF依赖于有限的对比模式，例如变化的模型变体或解码温度。这种单一性导致了两个问题：（1）对齐不够全面；从而（2）模型容易受到越狱攻击的影响。为了解决这些问题，我们研究如何构建更全面和多样化的对比模式来增强偏好数据（RQ1），并验证对比模式多样化对模型对齐的影响（RQ2）。对于RQ1，我们提出了PopAlign，一个框架，它在提示、模型和流水线级别整合了多样化的对比模式，引入了六种不需要额外反馈标记程序的对比策略。关于RQ2，我们进行了彻底的实验，证明PopAlign明显优于现有方法，导致更全面的对齐。

English

Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.

PopAlign：通过多样化对比模式实现更全面的对齐

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

摘要

Support