FuseChat: Слияние знаний моделей чатов

Аннотация

При обучении больших языковых моделей (LLM) с нуля действительно можно добиться создания моделей с отличными возможностями и преимуществами, однако это сопряжено с существенными затратами и может привести к избыточности компетенций. Задача слияния знаний направлена на интеграцию существующих LLM с различными архитектурами и возможностями в более мощную LLM путем легкого непрерывного обучения, тем самым уменьшая необходимость в дорогостоящем развитии LLM. В данной работе мы предлагаем новую структуру для слияния знаний чат-LLM через два основных этапа, что привело к созданию FuseChat. Во-первых, мы проводим попарное слияние знаний на исходных чат-LLM с различными структурами и масштабами для создания нескольких целевых LLM с идентичной структурой и размером путем легкой донастройки. В ходе этого процесса вводится подход к выравниванию токенов на основе статистики в качестве основы для слияния LLM с различными структурами. Во-вторых, мы объединяем эти целевые LLM в пространстве параметров, где мы предлагаем новый метод определения коэффициентов слияния на основе величины обновлений параметров до и после донастройки. Мы реализуем и проверяем FuseChat с использованием шести известных чат-LLM с различными архитектурами и масштабами, включая OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct и Qwen-1.5-Chat-72B. Экспериментальные результаты на двух бенчмарках по следованию инструкциям, AlpacaEval 2.0 и MT-Bench, демонстрируют превосходство FuseChat-7B над базовыми моделями различных размеров. Наша модель даже сравнима с более крупной Mixtral-8x7B-Instruct и приближается к GPT-3.5-Turbo-1106 на MT-Bench. Наш код, веса модели и данные доступны по адресу https://github.com/fanqiwan/FuseAI.

English

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes. Our model is even comparable to the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench. Our code, model weights, and data are public at https://github.com/fanqiwan/FuseAI.

FuseChat: Слияние знаний моделей чатов

FuseChat: Knowledge Fusion of Chat Models

Аннотация

Support