良好的開端是成功的一半：通過弱到強解碼實現低資源偏好對齊

摘要

大型語言模型（LLMs）需要與人類偏好保持一致，以避免生成冒犯性、虛假或無意義的內容。近年來，低資源的LLM對齊方法受到歡迎，但仍面臨獲取高質量且對齊內容的挑戰。基於觀察到生成對齊回應的難度集中在解碼的初始階段，我們提出了一個新框架——弱到強解碼（WSD），通過一個小型對齊模型的指導來增強基礎模型的對齊能力。小型模型首先起草良好對齊的開頭，隨後由大型基礎模型繼續完成剩餘部分，並通過精心設計的自動切換機制進行控制。我們還收集了一個新數據集GenerAlign，用於微調一個小型Pilot-3B作為草稿模型，這在WSD框架下有效增強了不同基礎模型，使其超越所有基準方法，同時避免了在下游任務上的性能下降，即所謂的對齊稅。我們進一步進行了大量實驗，以檢驗不同設置和時間效率的影響，並深入分析了WSD的內在機制。

English

Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.

良好的開端是成功的一半：通過弱到強解碼實現低資源偏好對齊

Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding

摘要

Support