OpenBezoar:基于混合指导数据训练的小型、经济高效且开放的模型
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
April 18, 2024
作者: Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake
cs.AI
摘要
对各种不同下游任务进行指导微调预训练语言模型(LLMs)已经取得了显著成功,并引起了学术界和实践者的兴趣。为了确保这些经过精细调整的LLMs符合人类偏好,出现了RLHF和DPO等技术。与此同时,对模型的参数数量变得越来越感兴趣。在这项工作中,我们以OpenLLaMA 3Bv2作为基础模型,描述了用于微调OpenBezoar系列模型的配方。在这个配方中:我们首先使用Falcon-40B模型的一个开放且商业非限制性的指导微调变体,在三种方案下生成合成指导微调数据,这三种方案基于:LaMini-LM,WizardLM/Evol-Instruct(使用databricks-dolly-15k作为种子数据集)和Orca(使用Flan Collection作为种子数据集),然后利用GPT-4作为人类代理筛选这些生成物。然后,我们依次使用基于QLoRA的成本效益高的监督微调对每种方案进行微调。进一步微调生成的检查点,使用HH-RLHF数据集的子集以最小化分布转移,然后使用DPO损失获得最终检查点。通过LM Eval Harness任务/指标以及在MT-Bench上使用“LLM作为评判者”框架进行评估,发现最终检查点“OpenBezoar-HH-RLHF-DPO”在3B参数规模上表现优异,甚至在Huggingface Open LLM排行榜的某个类别中胜过顶尖模型。我们在HuggingFace上发布了“OpenBezoar-SFT”、“OpenBezoar-HH-RLHF-SFT”、“OpenBezoar-HH-RLHF-DPO”检查点,以及我们在https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc发布的生成的数据集,以及我们的代码库在https://bitbucket.org/paladinanalytics/workspace/projects/OP。
English
Instruction fine-tuning pretrained LLMs for diverse downstream tasks has
demonstrated remarkable success and has captured the interest of both academics
and practitioners. To ensure such fine-tuned LLMs align with human preferences,
techniques such as RLHF and DPO have emerged. At the same time, there is
increasing interest in smaller parameter counts for models. In this work, using
OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the
OpenBezoar family of models. In this recipe: We first generate synthetic
instruction fine-tuning data using an open and commercially non-restrictive
instruction fine-tuned variant of the Falcon-40B model under three schemes
based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a
seed dataset) and Orca (with the Flan Collection as a seed dataset), then
filter these generations using GPT-4 as a human proxy. We then perform
cost-effective QLoRA-based supervised fine-tuning sequentially with each
scheme. The resulting checkpoint is further fine-tuned with a subset of the
HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to
obtain the final checkpoint. Evaluation is done with the LM Eval Harness
tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with
Claude 2.1, with the finding that the final checkpoint,
"OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at
the 3B parameter scale, even outperforming the top model in one of the
categories on the Huggingface Open LLM Leaderboard. We release
"OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO"
checkpoints, alongside our generated datasets on HuggingFace at
https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc
and our codebase at
https://bitbucket.org/paladinanalytics/workspace/projects/OP.