COIG-P: 人間の価値観に沿った高品質で大規模な中国語選好データセット

要旨

大規模言語モデル（LLM）を人間の選好に合わせることは、目覚ましい成功を収めてきました。しかし、既存の中国語選好データセットは、規模が小さい、ドメインのカバー範囲が狭い、厳密なデータ検証が欠如しているといった課題を抱えています。さらに、指示と応答のラベリングに人間のアノテーターを依存していることが、人間選好データセットの拡張性を大きく制約しています。これらの課題に対処するため、我々は人間の介入を一切必要としないLLMベースの中国語選好データセットアノテーションパイプラインを設計しました。具体的には、92,000件の高品質な中国語クエリをクロールし、慎重にフィルタリングした後、15の主流LLMを使用して選択された応答と拒否された応答のペアを生成し、スコア付けしました。これに基づいて、我々はCOIG-P（Chinese Open Instruction Generalist - Preference）を導入しました。これは、チャット、コード、数学、論理、小説、ロールの6つの多様なドメインにまたがる1,009,000件の中国語選好ペアからなる高品質で大規模な中国語選好データセットです。COIG-Pを基盤として、LLMを使用したスコアリングのオーバーヘッドを削減するために、8Bサイズの中国語報酬モデル（CRM）をトレーニングし、中国語報酬ベンチマーク（CRBench）を慎重に構築しました。AlignBench liu2024alignbenchbenchmarkingchinesealignment に基づく評価結果は、COIG-Pが他の中国語選好データセットを大幅に上回り、Qwen2/2.5およびInfinity-Instruct-3M-0625モデルシリーズに対してそれぞれ2%から12%の性能向上をもたらすことを示しています。CRBenchの結果は、我々のCRMが強力で堅牢なスコアリング能力を持っていることを示しています。我々はこれをCOIG-Pのテスト分割で選択された応答と拒否された応答のペアをフィルタリングするために適用し、実験の結果、低品質なサンプルを識別する点でGPT-4oに匹敵しつつ、効率性とコスト効果を維持していることが示されました。我々のコードとデータはhttps://github.com/multimodal-art-projection/COIG-Pで公開されています。

English

Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench liu2024alignbenchbenchmarkingchinesealignment show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.

COIG-P: 人間の価値観に沿った高品質で大規模な中国語選好データセット

COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

要旨

Support