HelpSteer3-Preference: 多様なタスクと言語にわたるオープンな人間注釈付き選好データ

要旨

選好データセットは、人間のフィードバックからの強化学習（RLHF）を用いて汎用ドメインの指示追従型言語モデルを訓練するために不可欠です。各データリリースは、将来のデータ収集に対する期待を高めるため、公開されている選好データの品質と多様性を常に向上させる必要があります。このニーズに対応するため、私たちはHelpSteer3-Preferenceを紹介します。これは、CC-BY-4.0ライセンスの下で利用可能な高品質な人間による注釈付き選好データセットで、40,000以上のサンプルを含んでいます。これらのサンプルは、STEM、コーディング、多言語シナリオなど、大規模言語モデル（LLM）の多様な実世界のアプリケーションにわたっています。HelpSteer3-Preferenceを使用して、私たちはRM-Bench（82.4%）とJudgeBench（73.7%）で最高のパフォーマンスを達成する報酬モデル（RM）を訓練しました。これは、既存のRMから報告された最高の結果に対して大幅な改善（約10%の絶対値）を表しています。私たちは、HelpSteer3-Preferenceが生成型RMの訓練にも適用可能であり、私たちのRMを使用してポリシーモデルをRLHFで整合させる方法も示します。データセット（CC-BY-4.0）：https://huggingface.co/datasets/nvidia/HelpSteer3#preference

English

Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): https://huggingface.co/datasets/nvidia/HelpSteer3#preference