먼저 정제, 이후 정렬: 신뢰할 수 있는 LLM 정렬을 위한 선호 데이터 정제 벤치마킹

초록

인간의 피드백은 대규모 언어 모델(LLMs)을 인간의 선호도에 맞추는 데 중요한 역할을 합니다. 그러나 이러한 피드백은 종종 노이즈가 많거나 일관성이 없어 보상 모델의 품질을 저하시키고 정렬을 방해할 수 있습니다. 이러한 문제를 완화하기 위해 다양한 자동화된 데이터 정제 방법이 제안되었지만, 이들의 효과성과 일반화 가능성을 체계적으로 평가한 연구는 아직 부족합니다. 이러한 격차를 해소하기 위해, 우리는 LLM 정렬 맥락에서 13가지 선호도 데이터 정제 방법을 평가하기 위한 첫 번째 포괄적인 벤치마크를 소개합니다. PrefCleanBench는 다양한 데이터셋, 모델 아키텍처, 최적화 알고리즘에 걸쳐 정렬 성능과 일반화 가능성 측면에서 정제 전략을 평가하기 위한 표준화된 프로토콜을 제공합니다. 다양한 방법을 통합하고 엄격하게 비교함으로써, 우리는 정렬 작업에서 데이터 정제의 성공을 결정하는 주요 요소를 발견했습니다. 이 벤치마크는 더 나은 데이터 품질을 통해 LLM 정렬을 개선하기 위한 원칙적이고 재현 가능한 접근 방식의 기반을 마련하며, 책임 있는 AI 개발에서 데이터 전처리의 중요하지만 충분히 탐구되지 않은 역할을 강조합니다. 우리는 모든 방법의 모듈형 구현을 공개하여 추가 연구를 촉진합니다: https://github.com/deeplearning-wisc/PrefCleanBench.

English

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

먼저 정제, 이후 정렬: 신뢰할 수 있는 LLM 정렬을 위한 선호 데이터 정제 벤치마킹

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

초록

Support