Ensemble-Instruct: Generazione di Dati per il Fine-Tuning su Istruzioni con una Miscela Eterogenea di Modelli Linguistici

Abstract

Utilizzando l'apprendimento in contesto (ICL) per la generazione di dati, tecniche come Self-Instruct (Wang et al., 2023) o il successivo Alpaca (Taori et al., 2023) possono addestrare agenti conversazionali robusti con solo una piccola quantità di supervisione umana. Una limitazione di questi approcci è che si basano su modelli linguistici molto grandi (circa 175 miliardi di parametri) che sono anche proprietari e non pubblici. Qui esploriamo l'applicazione di tali tecniche a modelli linguistici molto più piccoli (circa 10-40 miliardi di parametri) e con licenze permissive. Troviamo che l'approccio Self-Instruct sia meno efficace a queste dimensioni e proponiamo nuovi metodi ICL che si basano su due idee principali: (a) Categorizzazione e semplificazione dei template ICL per rendere più facile l'apprendimento dei prompt per il modello linguistico (LM), e (b) Ensembling su più output del LM per aiutare a selezionare esempi sintetici di alta qualità. Il nostro algoritmo sfrutta i 175 task seed di Self-Instit e impiega pipeline separate per istruzioni che richiedono un input e istruzioni che non lo richiedono. Le indagini empiriche con diversi LM mostrano che: (1) Il nostro metodo proposto produce dati di tuning delle istruzioni di qualità superiore rispetto a Self-Instruct, (2) Migliora le prestazioni sia dei LM vanilla che di quelli con tuning delle istruzioni con margini significativi, e (3) I LM più piccoli con tuning delle istruzioni generano output più utili rispetto alle loro controparti più grandi non ottimizzate. Il nostro codice è disponibile all'indirizzo https://github.com/IBM/ensemble-instruct.

English

Using in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B--40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) Categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) Ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful outputs than their larger un-tuned counterparts. Our codebase is available at https://github.com/IBM/ensemble-instruct.

Ensemble-Instruct: Generazione di Dati per il Fine-Tuning su Istruzioni con una Miscela Eterogenea di Modelli Linguistici

Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

Abstract

Support