Skywork-Reward: LLMにおける報酬モデリングのためのトリックの袋

要旨

このレポートでは、LLMの報酬モデリングを向上させるための手法のコレクションを紹介し、特にデータ中心のテクニックに焦点を当てています。高品質のオープンソースの選好データセットをキュレーションするための効果的なデータ選択とフィルタリング戦略を提案し、80Kの選好ペアのみを含むSkywork-Rewardデータコレクションを作成しました。このキュレーションされたデータセットを使用して、Skywork-RewardモデルシリーズであるSkywork-Reward-Gemma-27BとSkywork-Reward-Llama-3.1-8Bを開発しました。前者は現在、RewardBenchのリーダーボードでトップの位置を占めています。特筆すべきは、当社の手法とデータセットがRewardBench上で多くのトップランクモデルのパフォーマンスを直接向上させたことであり、現実世界の選好学習アプリケーションにおける当社の貢献の実用的な影響を強調しています。

English

In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

Skywork-Reward: LLMにおける報酬モデリングのためのトリックの袋

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

要旨

Support