OpenBezoar：命令データの混合で訓練された小型でコスト効率の良いオープンモデル

要旨

事前学習済み大規模言語モデル（LLM）を多様な下流タスク向けに指示ファインチューニングすることは、顕著な成功を収めており、学界と実務界の双方から注目を集めています。このようなファインチューニングされたLLMが人間の好みに沿うことを保証するため、RLHFやDPOといった技術が登場しています。同時に、モデルのパラメータ数を削減することへの関心も高まっています。本研究では、OpenLLaMA 3Bv2をベースモデルとして使用し、OpenBezoarファミリーモデルのファインチューニングに用いた手法を説明します。この手法では、まず、Falcon-40Bモデルのオープンで商用利用に制限のない指示ファインチューニング版を使用して、LaMini-LM、WizardLM/Evol-Instruct（databricks-dolly-15kをシードデータセットとして）、Orca（Flan Collectionをシードデータセットとして）の3つのスキームに基づいて合成指示ファインチューニングデータを生成し、GPT-4を人間の代理として使用してこれらの生成データをフィルタリングします。次に、各スキームに対して順番に、コスト効率の良いQLoRAベースの教師ありファインチューニングを実施します。得られたチェックポイントは、DPO損失を使用して最終チェックポイントを得る前に、分布シフトを最小化するためにHH-RLHFデータセットのサブセットでさらにファインチューニングされます。評価は、LM Eval Harnessのタスク/メトリクスと、Claude 2.1を使用した「LLM-as-a-judge」フレームワークによるMT-Benchで行われ、最終チェックポイント「OpenBezoar-HH-RLHF-DPO」が3Bパラメータスケールの多くのモデルを上回り、Huggingface Open LLM Leaderboardの1つのカテゴリーでトップモデルをも凌駕する性能を示すことがわかりました。我々は「OpenBezoar-SFT」、「OpenBezoar-HH-RLHF-SFT」、「OpenBezoar-HH-RLHF-DPO」のチェックポイントと生成データセットをHuggingFaceのhttps://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc で、コードベースをhttps://bitbucket.org/paladinanalytics/workspace/projects/OP で公開しています。

English

Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc and our codebase at https://bitbucket.org/paladinanalytics/workspace/projects/OP.

OpenBezoar：命令データの混合で訓練された小型でコスト効率の良いオープンモデル

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

要旨

Support