通过自主导优化对齐大型语言模型

摘要

自动对齐开发具有最少人为干预的对齐系统。自动对齐的关键在于提供可学习且准确的偏好信号，用于偏好学习，无需人工标注。本文介绍了自主优化（SSO）算法，该算法在迭代训练过程中基于预定义原则自动生成高质量的偏好信号，消除了手动标注的需求。SSO通过确保所选和被拒绝响应之间保持一致的差距，同时使它们都符合当前策略模型的学习能力，从而保持信号的准确性。SSO可以使策略模型的在线和离线训练受益，同时增强奖励模型的训练。我们使用两个基础模型，Qwen2和Llama3.1，验证了SSO的有效性，表明它在迭代训练过程中提供了准确、符合策略的偏好信号。在没有任何手动标注或外部模型的情况下，SSO在六个主观或客观基准测试中显著提高了性能。此外，SSO生成的偏好数据显著提升了奖励模型在Rewardbench上的性能。我们的工作提出了一种可扩展的偏好优化方法，为更高效、更有效的自动对齐铺平了道路。

English

Automated alignment develops alignment systems with minimal human intervention. The key to automated alignment lies in providing learnable and accurate preference signals for preference learning without human annotation. In this paper, we introduce Self-Steering Optimization (SSO), an algorithm that autonomously generates high-quality preference signals based on predefined principles during iterative training, eliminating the need for manual annotation. SSO maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses while keeping them both on-policy to suit the current policy model's learning capacity. SSO can benefit the online and offline training of the policy model, as well as enhance the training of reward models. We validate the effectiveness of SSO with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals throughout iterative training. Without any manual annotation or external models, SSO leads to significant performance improvements across six subjective or objective benchmarks. Besides, the preference data generated by SSO significantly enhanced the performance of the reward model on Rewardbench. Our work presents a scalable approach to preference optimization, paving the way for more efficient and effective automated alignment.

通过自主导优化对齐大型语言模型

Aligning Large Language Models via Self-Steering Optimization

摘要

Support