ChatPaper.aiChatPaper

不可能的蒸餾:從低質量模型到高質量數據集和模型,用於摘要和改寫。

Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing

May 26, 2023
作者: Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi
cs.AI

摘要

一般認為,最強的語言模型(LMs)依賴於龐大的規模、指導數據和人類反饋的結合,以執行專業任務,例如摘要和改寫,而無需監督。在本文中,我們提出語言模型可以學習總結和改寫句子,而無需這三個因素。我們提出了「不可能蒸餾」,這是一個框架,可以直接從現成的LM中提煉出一個特定任務的數據集,即使該LM本身無法可靠地解決該任務。通過在生成的數據集上訓練一個學生模型,並通過自蒸餾來增強其能力,我們的方法可以從低質量的教師模型中獲得高質量的模型和數據集,而無需規模或監督。使用「不可能蒸餾」,我們能夠提煉出一個比175B參數的GPT-3性能更好(在質量和可控性方面)的模型,該模型的規模小了一個數量級,只有770M參數,這是通過自動和人工評估確認的。此外,作為我們方法的一個有用副產品,我們獲得了DIMSUM+,一個包含3.4M句子摘要和改寫的高質量數據集。我們的分析顯示,作為一個純粹由LM生成的語料庫,這個數據集比所有人類編寫的數據集更多樣化,更適合泛化到未知領域,包括包含4M樣本的Gigaword。
English
It is commonly perceived that the strongest language models (LMs) rely on a combination of massive scale, instruction data, and human feedback to perform specialized tasks -- e.g. summarization and paraphrasing, without supervision. In this paper, we propose that language models can learn to summarize and paraphrase sentences, with none of these 3 factors. We present Impossible Distillation, a framework that distills a task-specific dataset directly from an off-the-shelf LM, even when it is impossible for the LM itself to reliably solve the task. By training a student model on the generated dataset and amplifying its capability through self-distillation, our method yields a high-quality model and dataset from a low-quality teacher model, without the need for scale or supervision. Using Impossible Distillation, we are able to distill an order of magnitude smaller model (with only 770M parameters) that outperforms 175B parameter GPT-3, in both quality and controllability, as confirmed by automatic and human evaluations. Furthermore, as a useful byproduct of our approach, we obtain DIMSUM+, a high-quality dataset with 3.4M sentence summaries and paraphrases. Our analyses show that this dataset, as a purely LM-generated corpus, is more diverse and more effective for generalization to unseen domains than all human-authored datasets -- including Gigaword with 4M samples.
PDF11December 15, 2024