언어 모델의 제어된 디코딩

초록

우리는 언어 모델의 자동회귀적 생성을 고보상 결과로 제어하기 위한 새로운 오프-폴리시 강화학습 방법인 제어 디코딩(Controlled Decoding, CD)을 제안합니다. CD는 보상에 대한 가치 함수, 즉 프리픽스 스코어러(prefix scorer)를 통해 오프-폴리시 강화학습 문제를 해결합니다. 이 프리픽스 스코어러는 추론 시에 생성 과정을 더 높은 보상 결과로 유도하는 데 사용됩니다. 우리는 프리픽스 스코어러가 (잠재적으로) 오프-폴리시 데이터에서 학습되어 부분적으로 디코딩된 응답에서 디코딩을 계속할 때의 기대 보상을 예측할 수 있음을 보여줍니다. 또한, CD가 Reddit 대화 코퍼스에서 효과적인 제어 메커니즘으로 작동함을 실증적으로 입증합니다. 더 나아가, CD 설계의 모듈성 덕분에 다중 보상을 제어할 수 있어 추가적인 복잡성 없이 다중 목표 강화학습 문제를 효과적으로 해결할 수 있음을 보여줍니다. 마지막으로, CD는 훈련 시 변경 없이도 추론 시에 새로운 블록 단위 방식으로 적용될 수 있어, 널리 사용되는 best-of-K 전략과 토큰 수준 강화학습 간의 간극을 메우는 가능성을 제시합니다. 이는 CD가 언어 모델 정렬을 위한 유망한 접근법임을 시사합니다.

English

We propose controlled decoding (CD), a novel off-policy reinforcement learning method to control the autoregressive generation from language models towards high reward outcomes. CD solves an off-policy reinforcement learning problem through a value function for the reward, which we call a prefix scorer. The prefix scorer is used at inference time to steer the generation towards higher reward outcomes. We show that the prefix scorer may be trained on (possibly) off-policy data to predict the expected reward when decoding is continued from a partially decoded response. We empirically demonstrate that CD is effective as a control mechanism on Reddit conversations corpus. We also show that the modularity of the design of CD makes it possible to control for multiple rewards, effectively solving a multi-objective reinforcement learning problem with no additional complexity. Finally, we show that CD can be applied in a novel blockwise fashion at inference-time, again without the need for any training-time changes, essentially bridging the gap between the popular best-of-K strategy and token-level reinforcement learning. This makes CD a promising approach for alignment of language models.

언어 모델의 제어된 디코딩

Controlled Decoding from Language Models

초록

Support