WILDCHAT-50M:深入探討合成數據在後訓練中的作用
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
January 30, 2025
作者: Benjamin Feuer, Chinmay Hegde
cs.AI
摘要
語言模型(LLM)的事後訓練,從DPO到蒸餾,可以精煉行為並開啟新技能,但支持這些事後訓練技術的開放科學仍處於起步階段。一個限制因素是進行大規模比較分析合成數據生成模型和LLM評判的困難。為了彌補這一差距,我們介紹了迄今為止最大的公共聊天數據集WILDCHAT-50M。我們擴展現有的WildChat數據集,不僅包括來自GPT的回應,還包括來自50多種不同的開放權重模型,其參數大小範圍從0.5B到104B。我們進行了廣泛的比較分析,並通過創建RE-WILD來展示這個數據集的潛力,我們自己的公共SFT混合,在僅有Tulu-3 SFT混合40%樣本的情況下,優於Allen AI最近的Tulu-3 SFT混合。我們的數據集、樣本和代碼可在https://github.com/penfever/wildchat-50m找到。
English
Language model (LLM) post-training, from DPO to distillation, can refine
behaviors and unlock new skills, but the open science supporting these
post-training techniques is still in its infancy. One limiting factor has been
the difficulty of conducting large-scale comparative analyses of synthetic data
generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M,
the largest public chat dataset to date. We extend the existing WildChat
dataset to include responses not only from GPT, but from over 50 different
open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an
extensive comparative analysis and demonstrate the potential of this dataset by
creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3
SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples
and code are available at https://github.com/penfever/wildchat-50m.Summary
AI-Generated Summary