良好的开端是成功的一半：通过弱到强解码实现低资源偏好对齐

摘要

大型语言模型（LLMs）需与人类偏好对齐，以避免生成冒犯性、虚假或无意义的内容。近期，低资源对齐方法备受关注，但仍面临获取高质量且对齐内容的挑战。基于解码初期生成对齐响应难度较大的观察，我们提出了一种新颖框架——弱到强解码（WSD），通过小型对齐模型的引导增强基础模型的对齐能力。该框架中，小型模型首先起草良好对齐的开头，随后由大型基础模型在精心设计的自动切换机制控制下完成剩余部分。我们还收集了一个新数据集GenerAlign，用于微调小型Pilot-3B作为草稿模型，有效提升了WSD框架下不同基础模型的性能，超越所有基线方法，同时避免了在下游任务上的性能下降，即所谓的“对齐税”。进一步开展了大量实验，考察不同设置的影响及时间效率，并对WSD的内在机制进行了深入分析。

English

Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.

良好的开端是成功的一半：通过弱到强解码实现低资源偏好对齐

Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding

摘要

Support