BOND: 최적의 N개 증류를 통한 대형 언어 모델 정렬

초록

인간 피드백을 통한 강화 학습(RLHF)은 최첨단 대규모 언어 모델의 품질과 안전성을 향상시키는 핵심 동력입니다. 그러나 놀랍도록 간단하면서도 강력한 추론 시 전략으로는 N개의 후보 중 최적의 생성을 선택하는 Best-of-N 샘플링이 있습니다. 본 논문에서는 Best-of-N의 장점을 유지하면서도 추론 시 상당한 계산 비용을 줄이는 새로운 RLHF 알고리즘인 Best-of-N 증류(BOND)를 제안합니다. 구체적으로, BOND는 정책에서 생성된 분포가 Best-of-N 분포에 가까워지도록 강제하는 분포 매칭 알고리즘입니다. 우리는 모드 커버링과 모드 시킹 행동 사이의 균형을 맞추기 위해 제프리즈 발산(전방 및 후방 KL의 선형 조합)을 사용하고, 효율성을 위해 이동 앵커를 활용하는 반복적 공식을 도출했습니다. 요약 생성 및 Gemma 모델에 대한 실험을 통해 우리의 접근 방식과 여러 설계 선택의 효과를 입증했습니다. BOND를 사용하여 Gemma 정책을 정렬하면 여러 벤치마크에서 결과를 개선함으로써 다른 RLHF 알고리즘을 능가하는 성능을 보였습니다.

English

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

BOND: 최적의 N개 증류를 통한 대형 언어 모델 정렬

BOND: Aligning LLMs with Best-of-N Distillation

초록

Support