不可能的蒸馏:从低质量模型到高质量数据集和模型的总结和释义
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
May 26, 2023
作者: Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi
cs.AI
摘要
普遍认为,最强大的语言模型(LMs)依赖于大规模、指导数据和人类反馈的结合来执行专门任务,例如总结和改写,无需监督。在本文中,我们提出语言模型可以学习总结和改写句子,而无需这三个因素。我们提出了“不可能蒸馏”(Impossible Distillation)框架,该框架可以直接从现成的LM中蒸馏出一个特定任务的数据集,即使LM本身无法可靠地解决该任务。通过在生成的数据集上训练学生模型,并通过自蒸馏增强其能力,我们的方法可以从低质量的教师模型中产生高质量的模型和数据集,而无需大规模或监督。使用“不可能蒸馏”,我们能够蒸馏出一个数量级更小的模型(仅有770M参数),在质量和可控性方面均优于175B参数的GPT-3,这得到了自动和人工评估的确认。此外,作为我们方法的一个有用副产品,我们获得了DIMSUM+,一个包含3.4M个句子摘要和改写的高质量数据集。我们的分析表明,作为纯LM生成的语料库,这个数据集比所有人类编写的数据集(包括包含4M个样本的Gigaword)更多样化,更有效地泛化到未知领域。
English
It is commonly perceived that the strongest language models (LMs) rely on a
combination of massive scale, instruction data, and human feedback to perform
specialized tasks -- e.g. summarization and paraphrasing, without supervision.
In this paper, we propose that language models can learn to summarize and
paraphrase sentences, with none of these 3 factors. We present Impossible
Distillation, a framework that distills a task-specific dataset directly from
an off-the-shelf LM, even when it is impossible for the LM itself to reliably
solve the task. By training a student model on the generated dataset and
amplifying its capability through self-distillation, our method yields a
high-quality model and dataset from a low-quality teacher model, without the
need for scale or supervision. Using Impossible Distillation, we are able to
distill an order of magnitude smaller model (with only 770M parameters) that
outperforms 175B parameter GPT-3, in both quality and controllability, as
confirmed by automatic and human evaluations. Furthermore, as a useful
byproduct of our approach, we obtain DIMSUM+, a high-quality dataset with 3.4M
sentence summaries and paraphrases. Our analyses show that this dataset, as a
purely LM-generated corpus, is more diverse and more effective for
generalization to unseen domains than all human-authored datasets -- including
Gigaword with 4M samples.