まず清潔に、その後整列させる：信頼性の高いLLM整列のための選好データクリーニングのベンチマーク

要旨

人間のフィードバックは、大規模言語モデル（LLM）を人間の好みに合わせる上で重要な役割を果たします。しかし、そのようなフィードバックはしばしばノイズが多く一貫性に欠けるため、報酬モデルの品質を低下させ、アライメントを妨げる可能性があります。この問題を緩和するために、さまざまな自動データクリーニング手法が提案されていますが、それらの有効性と汎用性を体系的に評価する研究はまだ不足しています。このギャップを埋めるため、我々はLLMアライメントの文脈で13の選好データクリーニング手法を評価する初の包括的なベンチマークを導入します。PrefCleanBenchは、多様なデータセット、モデルアーキテクチャ、最適化アルゴリズムにわたるアライメント性能と汎用性の観点からクリーニング戦略を評価するための標準化されたプロトコルを提供します。異なる手法を統合し、厳密に比較することで、アライメントタスクにおけるデータクリーニングの成功を決定する主要な要因を明らかにします。このベンチマークは、データ品質の向上を通じてLLMアライメントを改善するための原則的で再現可能なアプローチの基盤を築き、責任あるAI開発におけるデータ前処理の重要な役割を浮き彫りにします。さらなる研究を促進するため、すべての手法のモジュール実装を公開します：https://github.com/deeplearning-wisc/PrefCleanBench。

English

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

まず清潔に、その後整列させる：信頼性の高いLLM整列のための選好データクリーニングのベンチマーク

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

要旨

Support