차등 정보: 선호 최적화에 대한 정보 이론적 관점

초록

Direct Preference Optimization(DPO)은 지도 학습 방식으로 언어 모델을 인간의 선호도에 맞추기 위한 표준 기술로 자리 잡았습니다. 경험적으로는 성공적이었으나, 로그 비율 보상 파라미터화에 대한 이론적 근거는 여전히 불완전한 상태였습니다. 본 연구에서는 이러한 격차를 해소하기 위해 Differential Information Distribution(DID)을 활용합니다: DID는 정책 업데이트 과정에서 얻는 정보를 포착하는 토큰 시퀀스에 대한 분포입니다. 먼저, 선호도 레이블이 참조 정책을 목표 정책으로 변환하는 데 필요한 차등 정보를 인코딩할 때, DPO의 로그 비율 보상이 선호도 최적화를 통해 목표 정책을 학습하는 데 유일하게 최적의 형태로 나타남을 보입니다. 이 결과는 거부된 응답에 대한 최적 샘플링 분포의 폐쇄형 표현을 자연스럽게 도출합니다. 둘째, 선호도가 차등 정보를 인코딩하기 위한 조건은 로그 마진 정렬 정책에 대한 암묵적 가정과 근본적으로 연결되어 있음을 발견했습니다. 이는 선호도 최적화에서 널리 사용되지만 이전에는 인식되지 않은 귀납적 편향입니다. 마지막으로, DID의 엔트로피를 분석함으로써, 낮은 엔트로피 차등 정보를 학습하는 것이 정책 분포를 강화하는 반면, 높은 엔트로피 차등 정보는 평활화 효과를 유발함을 설명합니다. 이는 로그 가능도 변위 현상을 설명합니다. 우리는 이러한 이론적 발견을 합성 실험에서 검증하고, 실제 지시 따르기 데이터셋으로 확장합니다. 우리의 결과는 높은 엔트로피 차등 정보를 학습하는 것이 일반적인 지시 따르기 작업에 중요하며, 낮은 엔트로피 차등 정보를 학습하는 것이 지식 집약적 질문 응답에 유리함을 시사합니다. 전반적으로, 본 연구는 차등 정보의 관점에서 DPO 목적 함수, 선호도 데이터의 구조, 그리고 그에 따른 정책 행동에 대한 통합적 관점을 제시합니다.

English

Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.

차등 정보: 선호 최적화에 대한 정보 이론적 관점

Differential Information: An Information-Theoretic Perspective on Preference Optimization

초록

Support