自己誘導最適化を通じた大規模言語モデルの整列

要旨

自動整列は、最小限の人間の介入で整列システムを開発します。自動整列の鍵は、人間の注釈なしで学習可能で正確な選好学習のための選好信号を提供することにあります。本論文では、Self-Steering Optimization（SSO）というアルゴリズムを紹介し、反復的なトレーニング中に事前に定義された原則に基づいて高品質な選好信号を自律的に生成し、手動注釈の必要性を排除します。SSOは、選択された応答と拒否された応答の間の一貫したギャップを確保することで信号の精度を維持し、両方を現在の方針モデルの学習能力に適した方針に保ちます。SSOは、方針モデルのオンラインおよびオフラインのトレーニング、および報酬モデルのトレーニングを向上させることができます。我々は、Qwen2およびLlama3.1という2つの基礎モデルを用いてSSOの効果を検証し、反復的なトレーニング全体で正確で方針に従った選好信号を提供することを示しています。手動注釈や外部モデルなしで、SSOは6つの主観的または客観的なベンチマーク全体で著しい性能向上をもたらします。さらに、SSOによって生成された選好データは、Rewardbench上で報酬モデルの性能を著しく向上させました。私たちの研究は、より効率的かつ効果的な自動整列のためのスケーラブルな選好最適化手法を提示し、その道筋を開いています。

English

Automated alignment develops alignment systems with minimal human intervention. The key to automated alignment lies in providing learnable and accurate preference signals for preference learning without human annotation. In this paper, we introduce Self-Steering Optimization (SSO), an algorithm that autonomously generates high-quality preference signals based on predefined principles during iterative training, eliminating the need for manual annotation. SSO maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses while keeping them both on-policy to suit the current policy model's learning capacity. SSO can benefit the online and offline training of the policy model, as well as enhance the training of reward models. We validate the effectiveness of SSO with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals throughout iterative training. Without any manual annotation or external models, SSO leads to significant performance improvements across six subjective or objective benchmarks. Besides, the preference data generated by SSO significantly enhanced the performance of the reward model on Rewardbench. Our work presents a scalable approach to preference optimization, paving the way for more efficient and effective automated alignment.

自己誘導最適化を通じた大規模言語モデルの整列

Aligning Large Language Models via Self-Steering Optimization

要旨

Support