联合训练和联合蒸馏用于提高质量和压缩语言模型

摘要

知识蒸馏（KD）通过将知识转移至较小模型，压缩计算昂贵的预训练语言模型（PLMs），使其能够在资源受限或实时环境中使用。然而，大多数较小模型未能超越原始较大模型的性能，导致为改善推理速度而牺牲性能。为解决这一问题，我们提出了协同训练和协同蒸馏（CTCD），这是一个新颖的框架，通过同时协同训练两个模型并相互蒸馏知识来共同提高性能和推理速度。CTCD框架成功实现了这一点基于两个重要发现：1）在协同训练过程中，将知识从较小模型蒸馏到较大模型可以提高较大模型的性能。2）较大模型的增强性能进一步提升了较小模型的性能。CTCD框架显示出潜力，因为它可以与现有技术（如架构设计或数据增强）结合，取代单向KD方法，以实现进一步的性能提升。大量消融研究证明了CTCD的有效性，通过CTCD蒸馏的小模型在GLUE基准测试中的表现明显优于原始较大模型1.66个百分点。

English

Knowledge Distillation (KD) compresses computationally expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models, allowing their use in resource-constrained or real-time settings. However, most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed. To address this issue, we propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models while mutually distilling knowledge. The CTCD framework successfully achieves this based on two significant findings: 1) Distilling knowledge from the smaller model to the larger model during co-training improves the performance of the larger model. 2) The enhanced performance of the larger model further boosts the performance of the smaller model. The CTCD framework shows promise as it can be combined with existing techniques like architecture design or data augmentation, replacing one-way KD methods, to achieve further performance improvement. Extensive ablation studies demonstrate the effectiveness of CTCD, and the small model distilled by CTCD outperforms the original larger model by a significant margin of 1.66 on the GLUE benchmark.

联合训练和联合蒸馏用于提高质量和压缩语言模型

Co-training and Co-distillation for Quality Improvement and Compression of Language Models

摘要

Support