共同訓練和共同蒸餾用於提升語言模型的品質和壓縮

摘要

知識蒸餾（KD）通過將計算昂貴的預訓練語言模型（PLMs）的知識轉移到更小的模型，從而將其壓縮，使其可以在資源受限或實時環境中使用。然而，大多數較小的模型無法超越原始較大模型的性能，這導致犧牲性能以提高推理速度。為了解決這個問題，我們提出了一個新穎的框架，稱為共訓練和共蒸餾（CTCD），通過共同訓練兩個模型並相互蒸餾知識，從而同時提高性能和推理速度。CTCD框架成功實現了這一點，基於兩個重要發現：1）在共訓練期間從較小模型向較大模型蒸餾知識可以提高較大模型的性能。2）較大模型的增強性能進一步提升了較小模型的性能。CTCD框架顯示出潛力，因為它可以與現有技術（如架構設計或數據擴增）結合，取代單向KD方法，以實現進一步的性能改進。廣泛的消融研究證明了CTCD的有效性，並且由CTCD蒸餾的小模型在GLUE基準測試中的表現優於原始較大模型1.66個顯著的邊際。

English

Knowledge Distillation (KD) compresses computationally expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models, allowing their use in resource-constrained or real-time settings. However, most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed. To address this issue, we propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models while mutually distilling knowledge. The CTCD framework successfully achieves this based on two significant findings: 1) Distilling knowledge from the smaller model to the larger model during co-training improves the performance of the larger model. 2) The enhanced performance of the larger model further boosts the performance of the smaller model. The CTCD framework shows promise as it can be combined with existing techniques like architecture design or data augmentation, replacing one-way KD methods, to achieve further performance improvement. Extensive ablation studies demonstrate the effectiveness of CTCD, and the small model distilled by CTCD outperforms the original larger model by a significant margin of 1.66 on the GLUE benchmark.

共同訓練和共同蒸餾用於提升語言模型的品質和壓縮

Co-training and Co-distillation for Quality Improvement and Compression of Language Models

摘要

Support