Co-addestramento e Co-distillazione per il Miglioramento della Qualità e la Compressione dei Modelli Linguistici

Abstract

Il Knowledge Distillation (KD) comprime modelli linguistici pre-addestrati (PLM) computazionalmente costosi trasferendo la loro conoscenza a modelli più piccoli, consentendone l'uso in contesti con risorse limitate o in tempo reale. Tuttavia, la maggior parte dei modelli più piccoli non riesce a superare le prestazioni del modello originale più grande, portando a un compromesso tra prestazioni e velocità di inferenza. Per affrontare questo problema, proponiamo Co-Training and Co-Distillation (CTCD), un nuovo framework che migliora contemporaneamente le prestazioni e la velocità di inferenza co-addestrando due modelli mentre si distillano reciprocamente la conoscenza. Il framework CTCD raggiunge questo obiettivo basandosi su due risultati significativi: 1) La distillazione della conoscenza dal modello più piccolo al modello più grande durante il co-training migliora le prestazioni del modello più grande. 2) Le prestazioni migliorate del modello più grande potenziano ulteriormente le prestazioni del modello più piccolo. Il framework CTCD si dimostra promettente poiché può essere combinato con tecniche esistenti come la progettazione dell'architettura o l'aumento dei dati, sostituendo i metodi di KD unidirezionali, per ottenere ulteriori miglioramenti delle prestazioni. Estesi studi di ablazione dimostrano l'efficacia di CTCD, e il modello piccolo distillato da CTCD supera il modello originale più grande con un margine significativo di 1,66 sul benchmark GLUE.

English

Knowledge Distillation (KD) compresses computationally expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models, allowing their use in resource-constrained or real-time settings. However, most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed. To address this issue, we propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models while mutually distilling knowledge. The CTCD framework successfully achieves this based on two significant findings: 1) Distilling knowledge from the smaller model to the larger model during co-training improves the performance of the larger model. 2) The enhanced performance of the larger model further boosts the performance of the smaller model. The CTCD framework shows promise as it can be combined with existing techniques like architecture design or data augmentation, replacing one-way KD methods, to achieve further performance improvement. Extensive ablation studies demonstrate the effectiveness of CTCD, and the small model distilled by CTCD outperforms the original larger model by a significant margin of 1.66 on the GLUE benchmark.

Co-addestramento e Co-distillazione per il Miglioramento della Qualità e la Compressione dei Modelli Linguistici

Co-training and Co-distillation for Quality Improvement and Compression of Language Models

Abstract

Support