가이디드 디코딩을 위한 반복적 가치 함수 최적화

초록

인간 피드백을 통한 강화 학습(RLHF)이 언어 모델 출력을 제어하는 주요 방법으로 자리 잡았지만, 이는 높은 계산 비용과 훈련 불안정성이라는 문제를 안고 있습니다. 가이디드 디코딩, 특히 가치 기반 방법은 모델을 재훈련하지 않고도 출력을 제어할 수 있는 비용 효율적인 대안을 제공합니다. 그러나 가치 기반 디코딩에서 가치 함수의 정확성은 매우 중요하며, 부정확한 가치 함수는 차선의 의사결정과 성능 저하로 이어질 수 있습니다. 기존 방법들은 최적의 가치 함수를 정확하게 추정하는 데 어려움을 겪어 효과적인 제어를 달성하지 못했습니다. 우리는 이러한 한계를 해결하기 위해 두 가지 핵심 요소로 구성된 새로운 프레임워크인 반복적 가치 함수 최적화(Iterative Value Function Optimization)를 제안합니다. 첫 번째 요소는 다양한 경로를 탐색하여 추정 분산을 줄이는 몬테카를로 가치 추정(Monte Carlo Value Estimation)이며, 두 번째 요소는 가치 기반 정책에서 수집한 경로를 통해 가치 추정을 점진적으로 개선하는 반복적 온-정책 최적화(Iterative On-Policy Optimization)입니다. 텍스트 요약, 다중 턴 대화, 명령어 수행에 대한 광범위한 실험을 통해 가치 기반 디코딩 접근법이 언어 모델 정렬에 효과적임을 입증했습니다. 이러한 접근법은 정렬을 달성할 뿐만 아니라 원칙적인 가치 함수 최적화를 활용하여 계산 비용을 크게 줄임으로써 효율적이고 효과적인 제어를 가능하게 합니다.

English

While Reinforcement Learning from Human Feedback (RLHF) has become the predominant method for controlling language model outputs, it suffers from high computational costs and training instability. Guided decoding, especially value-guided methods, offers a cost-effective alternative by controlling outputs without re-training models. However, the accuracy of the value function is crucial for value-guided decoding, as inaccuracies can lead to suboptimal decision-making and degraded performance. Existing methods struggle with accurately estimating the optimal value function, leading to less effective control. We propose Iterative Value Function Optimization, a novel framework that addresses these limitations through two key components: Monte Carlo Value Estimation, which reduces estimation variance by exploring diverse trajectories, and Iterative On-Policy Optimization, which progressively improves value estimation through collecting trajectories from value-guided policies. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of value-guided decoding approaches in aligning language models. These approaches not only achieve alignment but also significantly reduce computational costs by leveraging principled value function optimization for efficient and effective control.

가이디드 디코딩을 위한 반복적 가치 함수 최적화

Iterative Value Function Optimization for Guided Decoding

초록

Support