逐步合成:通过从小模型中推断错误来进行迭代数据集合成与大型语言模型
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models
October 20, 2023
作者: Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan
cs.AI
摘要
*数据合成* 是一种有前途的方法,可以用非常少的标记数据来训练小模型。一种数据合成的方法是利用大型语言模型的丰富知识,为小模型合成伪训练样本,从而同时实现数据和计算效率。然而,数据合成面临的一个关键挑战是,合成数据集往往与*真实任务*数据分布存在很大的差异。因此,在本文中,我们提出了*逐步合成*(**S3**),这是一个数据合成框架,通过迭代地利用大型语言模型在小型真实验证数据集上推断出小模型在合成数据集上的错误,从而缩小这种分布差距。在多个自然语言处理任务上进行的大量实验表明,我们的方法通过减少合成数据集与真实数据之间的差距,显著提高了小模型的性能,相较于几种基准方法取得了显著改进:与ZeroGen相比提高了9.48%,与GoldGen相比提高了2.73%,与基于人工标注数据训练的小模型相比最多提高了15.17%。
English
*Data Synthesis* is a promising way to train a small model with very little
labeled data. One approach for data synthesis is to leverage the rich knowledge
from large language models to synthesize pseudo training examples for small
models, making it possible to achieve both data and compute efficiency at the
same time. However, a key challenge in data synthesis is that the synthesized
dataset often suffers from a large distributional discrepancy from the *real
task* data distribution. Thus, in this paper, we propose *Synthesis Step by
Step* (**S3**), a data synthesis framework that shrinks this distribution gap
by iteratively extrapolating the errors made by a small model trained on the
synthesized dataset on a small real-world validation dataset using a large
language model. Extensive experiments on multiple NLP tasks show that our
approach improves the performance of a small model by reducing the gap between
the synthetic dataset and the real data, resulting in significant improvement
compared to several baselines: 9.48% improvement compared to ZeroGen and 2.73%
compared to GoldGen, and at most 15.17% improvement compared to the small model
trained on human-annotated data.