OpenBezoar:基於混合指示數據訓練的小型、具成本效益且開放的模型
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
April 18, 2024
作者: Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake
cs.AI
摘要
對於各種不同的下游任務進行預訓練語言模型(LLMs)的微調已經取得了顯著的成功,引起了學術界和從業者的興趣。為確保這些微調的LLMs符合人類的偏好,出現了RLHF和DPO等技術。同時,對於模型的參數量越來越小也引起了興趣。在這項工作中,我們以OpenLLaMA 3Bv2作為基礎模型,描述了用於微調OpenBezoar系列模型的配方。在這個配方中:我們首先使用Falcon-40B模型的開放且商業非限制性的指令微調變體,在三種方案下生成合成指令微調數據,這三種方案分別基於:LaMini-LM、WizardLM/Evol-Instruct(使用databricks-dolly-15k作為種子數據集)和Orca(使用Flan Collection作為種子數據集),然後使用GPT-4作為人類代理對這些生成進行篩選。然後,我們依次使用基於QLoRA的成本效益高的監督式微調來執行每個方案。最終檢查點進一步使用HH-RLHF數據集的子集進行微調,以減少分布轉移,然後使用DPO損失獲取最終檢查點。評估是通過LM Eval Harness任務/指標以及使用“LLM作為評判”的框架在MT-Bench上進行的,並發現最終檢查點“OpenBezoar-HH-RLHF-DPO”在3B參數規模上表現優異,甚至在Huggingface Open LLM排行榜的某一類別中超越了頂尖模型。我們在HuggingFace上釋放了“OpenBezoar-SFT”、“OpenBezoar-HH-RLHF-SFT”、“OpenBezoar-HH-RLHF-DPO”檢查點,以及我們在https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc和我們的代碼庫在https://bitbucket.org/paladinanalytics/workspace/projects/OP上生成的數據集。
English
Instruction fine-tuning pretrained LLMs for diverse downstream tasks has
demonstrated remarkable success and has captured the interest of both academics
and practitioners. To ensure such fine-tuned LLMs align with human preferences,
techniques such as RLHF and DPO have emerged. At the same time, there is
increasing interest in smaller parameter counts for models. In this work, using
OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the
OpenBezoar family of models. In this recipe: We first generate synthetic
instruction fine-tuning data using an open and commercially non-restrictive
instruction fine-tuned variant of the Falcon-40B model under three schemes
based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a
seed dataset) and Orca (with the Flan Collection as a seed dataset), then
filter these generations using GPT-4 as a human proxy. We then perform
cost-effective QLoRA-based supervised fine-tuning sequentially with each
scheme. The resulting checkpoint is further fine-tuned with a subset of the
HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to
obtain the final checkpoint. Evaluation is done with the LM Eval Harness
tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with
Claude 2.1, with the finding that the final checkpoint,
"OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at
the 3B parameter scale, even outperforming the top model in one of the
categories on the Huggingface Open LLM Leaderboard. We release
"OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO"
checkpoints, alongside our generated datasets on HuggingFace at
https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc
and our codebase at
https://bitbucket.org/paladinanalytics/workspace/projects/OP.Summary
AI-Generated Summary