TAID：暫時適應性插值蒸餾，用於語言模型中的高效知識轉移

摘要

因果語言模型展現了卓越的能力，但其龐大的尺寸對於在資源受限環境中部署構成重大挑戰。知識蒸餾是一種廣泛使用的技術，用於將大型教師模型的知識轉移至小型學生模型，為模型壓縮提供了一種有前途的方法。一個重要的問題在於教師模型和學生模型之間存在的主要差異，即顯著的容量差距、模式平均和模式崩潰，這些在蒸餾過程中構成了障礙。為了應對這些問題，我們引入了一種新穎的知識蒸餾方法，稱為時間自適應插值蒸餾（TAID），通過一個自適應中間分佈動態地插值學生和教師分佈，逐漸從學生的初始分佈向教師的分佈過渡。我們提供了一個理論分析，證明了TAID防止模式崩潰的能力，並在實驗中展示了其在解決容量差距、平衡模式平均和模式崩潰方面的有效性。我們的全面實驗證明了TAID在各種模型尺寸和架構上的優越性能，無論是在指導調整還是預訓練情境下。此外，我們通過開發兩個最新的緊湊基礎模型TAID-LLM-1.5B用於語言任務和TAID-VLM-2B用於視覺語言任務，展示了TAID的實際影響。這些結果表明了TAID在創建高性能和高效模型方面的有效性，推動了更易接觸的人工智能技術的發展。

English

Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce Temporally Adaptive Interpolated Distillation (TAID), a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: TAID-LLM-1.5B for language tasks and TAID-VLM-2B for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

TAID：暫時適應性插值蒸餾，用於語言模型中的高效知識轉移

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

摘要

Support