OpenBezoar: 소규모, 비용 효율적이며 혼합 명령어 데이터로 학습된 오픈 모델

초록

다양한 다운스트림 작업을 위해 사전 학습된 대형 언어 모델(LLM)을 명령어 파인튜닝하는 것은 놀라운 성공을 거두며 학계와 실무자들의 관심을 끌고 있습니다. 이러한 파인튜닝된 LLM이 인간의 선호도와 일치하도록 보장하기 위해 RLHF(Reinforcement Learning from Human Feedback)와 DPO(Direct Preference Optimization)와 같은 기술이 등장했습니다. 동시에, 더 적은 매개변수를 가진 모델에 대한 관심도 증가하고 있습니다. 본 연구에서는 OpenLLaMA 3Bv2를 기본 모델로 사용하여 OpenBezoar 모델군을 파인튜닝하는 데 사용된 방법을 설명합니다. 이 방법에서는 먼저 Falcon-40B 모델의 오픈 소스이며 상업적 제약이 없는 명령어 파인튜닝 변종을 사용하여 세 가지 방식(LaMini-LM, WizardLM/Evol-Instruct(데이터셋으로 databricks-dolly-15k 사용), Orca(데이터셋으로 Flan Collection 사용))에 기반한 합성 명령어 파인튜닝 데이터를 생성한 후, GPT-4를 인간 대리자로 사용하여 이 생성물을 필터링합니다. 그런 다음 각 방식에 대해 비용 효율적인 QLoRA 기반의 지도 파인튜닝을 순차적으로 수행합니다. 결과로 얻은 체크포인트는 DPO 손실을 적용하여 최종 체크포인트를 얻기 전에 분포 이동을 최소화하기 위해 HH-RLHF 데이터셋의 일부로 추가 파인튜닝됩니다. 평가는 LM Eval Harness 작업/메트릭과 Claude 2.1을 사용한 "LLM-as-a-judge" 프레임워크를 통해 MT-Bench에서 수행되었으며, 최종 체크포인트인 "OpenBezoar-HH-RLHF-DPO"는 3B 매개변수 규모의 많은 모델을 능가하는 성능을 보여주었고, Huggingface Open LLM 리더보드의 한 카테고리에서 최고 모델을 능가하기도 했습니다. 우리는 "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" 체크포인트와 생성된 데이터셋을 HuggingFace(https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc)에 공개하고, 코드베이스는 Bitbucket(https://bitbucket.org/paladinanalytics/workspace/projects/OP)에서 확인할 수 있습니다.

English

Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc and our codebase at https://bitbucket.org/paladinanalytics/workspace/projects/OP.

OpenBezoar: 소규모, 비용 효율적이며 혼합 명령어 데이터로 학습된 오픈 모델

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

초록

Summary

Support

Support