逐步合成:通過從小型模型中推斷錯誤來進行迭代式數據集合成的大型語言模型
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models
October 20, 2023
作者: Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan
cs.AI
摘要
*資料合成* 是一種訓練小型模型並僅需少量標記資料的有前景方法。一種資料合成的方法是利用大型語言模型的豐富知識,為小型模型合成偽訓練範例,同時實現資料和計算效率。然而,資料合成中的一個關鍵挑戰是,合成的資料集往往與*真實任務*資料分佈存在著很大的差異。因此,在本文中,我們提出了*逐步合成*(**S3**),一個資料合成框架,通過迭代地擴展小型模型在合成資料集上的錯誤,並利用大型語言模型在小型真實驗證資料集上的表現來縮小這種分佈差距。在多個自然語言處理任務上進行的大量實驗表明,我們的方法通過減少合成資料集與真實資料之間的差距,顯著提高了小型模型的性能,相較於幾個基準方法:與 ZeroGen 相比提高了 9.48%,與 GoldGen 相比提高了 2.73%,並且相較於使用人工標註資料訓練的小型模型,最多提高了 15.17%。
English
*Data Synthesis* is a promising way to train a small model with very little
labeled data. One approach for data synthesis is to leverage the rich knowledge
from large language models to synthesize pseudo training examples for small
models, making it possible to achieve both data and compute efficiency at the
same time. However, a key challenge in data synthesis is that the synthesized
dataset often suffers from a large distributional discrepancy from the *real
task* data distribution. Thus, in this paper, we propose *Synthesis Step by
Step* (**S3**), a data synthesis framework that shrinks this distribution gap
by iteratively extrapolating the errors made by a small model trained on the
synthesized dataset on a small real-world validation dataset using a large
language model. Extensive experiments on multiple NLP tasks show that our
approach improves the performance of a small model by reducing the gap between
the synthetic dataset and the real data, resulting in significant improvement
compared to several baselines: 9.48% improvement compared to ZeroGen and 2.73%
compared to GoldGen, and at most 15.17% improvement compared to the small model
trained on human-annotated data.