Distillazione Efficiente verso Architetture Ibride xLSTM

Abstract

Ci sono stati numerosi tentativi di distillare modelli linguistici di grandi dimensioni (LLM) basati su meccanismi di attenzione quadratica in architetture linearizzate sub-quadratiche. Tuttavia, nonostante ricerche approfondite, tali modelli distillati spesso non riescono a eguagliare le prestazioni dei loro LLM insegnanti su vari task downstream. Ci siamo posti l'obiettivo di una distillazione senza perdite, che definiamo in termini di tassi Win-and-Tie corretti per la tolleranza tra studente e insegnante su insiemi di task. A tal fine, introduciamo una pipeline di distillazione efficace per studenti basati su xLSTM. Proponiamo una fase aggiuntiva di fusione, in cui esperti linearizzati individualmente vengono combinati in un unico modello. Dimostriamo l'efficacia di questa pipeline distillando modelli base e addestrati su istruzioni dalle famiglie Llama, Qwen e Olmo. In molti contesti, i nostri studenti basati su xLSTM recuperano la maggior parte delle prestazioni dell'insegnante e addirittura le superano in alcuni task downstream. I nostri contributi rappresentano un passo importante verso sostituti più efficienti dal punto di energetico e convenienti per gli LLM basati su transformer.

English

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

Distillazione Efficiente verso Architetture Ibride xLSTM

Effective Distillation to Hybrid xLSTM Architectures

Abstract

Support