ステップバイステップで合成しよう：小さなモデルのエラーを外挿し、大規模言語モデルを用いてデータセットを反復的に合成する

要旨

*データ合成*は、ラベル付きデータが非常に少ない状況で小さなモデルを訓練する有望な方法です。データ合成の一つのアプローチは、大規模言語モデルの豊富な知識を活用して、小さなモデルのための疑似訓練例を合成し、データ効率と計算効率の両方を同時に実現することです。しかし、データ合成における重要な課題は、合成されたデータセットが*実際のタスク*のデータ分布から大きな分布の乖離を抱えていることです。そこで、本論文では、*Synthesis Step by Step* (**S3**)というデータ合成フレームワークを提案します。このフレームワークは、大規模言語モデルを使用して、合成データセットで訓練された小さなモデルが小さな実世界の検証データセットで犯す誤差を反復的に外挿することで、この分布のギャップを縮小します。複数のNLPタスクでの大規模な実験により、我々のアプローチが合成データセットと実データの間のギャップを減らすことで小さなモデルの性能を向上させ、いくつかのベースラインと比較して大幅な改善をもたらすことが示されました：ZeroGenと比較して9.48%、GoldGenと比較して2.73%、そして人間が注釈を付けたデータで訓練された小さなモデルと比較して最大15.17%の改善が見られました。

English

*Data Synthesis* is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language models to synthesize pseudo training examples for small models, making it possible to achieve both data and compute efficiency at the same time. However, a key challenge in data synthesis is that the synthesized dataset often suffers from a large distributional discrepancy from the *real task* data distribution. Thus, in this paper, we propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap by iteratively extrapolating the errors made by a small model trained on the synthesized dataset on a small real-world validation dataset using a large language model. Extensive experiments on multiple NLP tasks show that our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data, resulting in significant improvement compared to several baselines: 9.48% improvement compared to ZeroGen and 2.73% compared to GoldGen, and at most 15.17% improvement compared to the small model trained on human-annotated data.

ステップバイステップで合成しよう：小さなモデルのエラーを外挿し、大規模言語モデルを用いてデータセットを反復的に合成する

Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models

要旨

Support