DRIFT: 실제 세계 선호도 학습에서 풍부한 사용자 불만족 데이터를 활용한 학습

초록

실제 세계에서의 대규모 언어 모델 배포(예: 대화형 AI 시스템, 코드 생성 보조 도구)는 사용자가 개선, 수정 및 선호 표현을 통해 더 나은 답변을 찾아가는 과정에서 자연스럽게 풍부한 암묵적 사용자 불만족(DSAT) 신호를 생성하는 반면, 명시적 만족(SAT) 피드백은 드물게 발생합니다. 기존의 선호 학습 접근법은 이러한 데이터 프로파일과 잘 맞지 않는데, 이는 비용이 많이 드는 인간 주석에 의존하거나 풍부한 긍정적 응답을 가정하기 때문입니다. 본 논문에서는 DRIFT(Dissatisfaction-Refined Iterative preFerence Training)를 소개합니다. DRIFT는 실제 세계의 DSAT 신호에 기반하여 학습을 고정하고, 진화하는 정책에서 긍정적 샘플을 동적으로 추출합니다. 실험적으로, 실제 세계의 WildFeedback 데이터셋과 합성 UltraFeedback 데이터셋으로 학습된 DRIFT 모델은 WildBench Task Score에서 7B 모델 기준 +6.23%, 14B 모델 기준 +7.61%의 성능 향상을 보였으며, AlpacaEval2 승률에서는 7B 모델 기준 +8.95%, 14B 모델 기준 +12.29%의 향상을 달성하여, 반복적 DPO 및 SPIN과 같은 강력한 베이스라인 방법을 능가했습니다. 더 큰 규모에서는 이러한 개선이 특히 두드러졌는데, DRIFT로 학습된 14B 모델은 WildBench에서 GPT-4o-mini를 능가했습니다. 추가 분석은 DRIFT가 탐색 능력을 유지하며, 좁은 부분 집합으로 수렴하지 않고 더 다양한 고수익 솔루션을 생성한다는 것을 보여줍니다. 이론적으로, 이 설계는 선호 마진을 유지하고 그래디언트 퇴화를 방지함을 입증합니다. 이러한 결과는 DRIFT가 가장 풍부하고 유익한 신호를 활용한 실제 세계의 사후 학습을 위한 효과적이고 확장 가능한 방법임을 보여줍니다. 코드와 데이터는 https://github.com/cacayaya/DRIFT.git에서 확인할 수 있습니다.

English

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce DRIFT (Dissatisfaction-Refined Iterative preFerence Training), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world WildFeedback datasets and synthetic UltraFeedback datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

DRIFT: 실제 세계 선호도 학습에서 풍부한 사용자 불만족 데이터를 활용한 학습

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

초록

Support