잘 시작한 것이 반은 성공한 것: 약한 디코딩에서 강한 디코딩으로의 저자원 환경에서의 선호 정렬

초록

대형 언어 모델(LLMs)은 공격적이거나 거짓된, 또는 무의미한 콘텐츠를 생성하지 않도록 인간의 선호도와 정렬(alignment)이 필요합니다. 최근에는 적은 자원을 사용하여 LLM 정렬을 수행하는 방법들이 주목받고 있지만, 여전히 고품질이면서 정렬된 콘텐츠를 얻는 데는 어려움이 있습니다. 디코딩 초기에 정렬된 응답을 생성하는 것이 특히 어렵다는 관찰에 기반하여, 우리는 작은 정렬된 모델의 지도를 통해 기본 모델의 정렬 능력을 향상시키는 새로운 프레임워크인 약한-강한 디코딩(Weak-to-Strong Decoding, WSD)을 제안합니다. 이 프레임워크에서는 작은 모델이 먼저 잘 정렬된 시작 부분을 초안으로 작성한 후, 대형 기본 모델이 나머지 부분을 이어가도록 하며, 이 과정은 잘 설계된 자동 전환 메커니즘에 의해 제어됩니다. 또한, 우리는 새로운 데이터셋인 GenerAlign을 수집하여 소형 Pilot-3B 모델을 초안 모델로 미세 조정하였으며, 이는 WSD 프레임워크 하에서 다양한 기본 모델을 효과적으로 강화하여 모든 기준 방법을 능가하는 동시에, 하위 작업에서의 성능 저하(alignment tax)를 방지합니다. 다양한 실험을 통해 서로 다른 설정과 시간 효율성의 영향을 검토하고, WSD의 내재적 메커니즘에 대한 심층 분석을 수행하였습니다.

English

Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.

잘 시작한 것이 반은 성공한 것: 약한 디코딩에서 강한 디코딩으로의 저자원 환경에서의 선호 정렬

Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding

초록

Support