잡음이 있는 선호도에서 학습: 직접 선호 최적화를 위한 준지도 학습 접근법

초록

사람의 시각적 선호도는 미학, 세부 정확도, 의미론적 일치 등 본질적으로 다차원적 특성을 지닙니다. 그러나 기존 데이터셋은 단일한 전체론적 주석만 제공하여, 일부 차원에서는 우수하지만 다른 차원에서는 부족한 이미지가 단순히 승자나 패자로 표시되는 심각한 라벨 노이즈를 초래합니다. 우리는 이론적으로 다차원 선호도를 이진 라벨로 압축할 경우 확산 DPO(Diffusion Direct Preference Optimization)를 오도하는 상충되는 그래디언트 신호가 생성됨을 입증합니다. 이를 해결하기 위해 우리는 일관된 쌍을 깨끗한 라벨 데이터로, 상충되는 쌍을 노이즈가 있는 비라벨 데이터로 취급하는 반지도 학습 기법인 Semi-DPO를 제안합니다. 우리의 방법은 합의 기반 필터링으로 정제된 깨끗한 부분집합으로 학습을 시작한 후, 이 모델을 암시적 분류기로 사용하여 노이즈 집합에 대한 의사 라벨을 생성하고 반복적으로 개선합니다. 실험 결과는 Semi-DPO가 최첨단 성능을 달성하고 훈련 중 추가 인간 주석이나 명시적 보상 모델 없이도 복잡한 인간 선호도와의 일치도를 크게 향상시킴을 보여줍니다. 우리는 코드와 모델을 https://github.com/L-CodingSpace/semi-dpo 에 공개할 예정입니다.

English

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L-CodingSpace/semi-dpo

잡음이 있는 선호도에서 학습: 직접 선호 최적화를 위한 준지도 학습 접근법

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

초록

Support