효과적인 하이브리드 xLSTM 아키텍처로의 지식 증류

초록

2차 어텐션 기반 대규모 언어 모델(LLM)을 준2차 선형화 아키텍처로 정제하려는 수많은 시도가 있어 왔습니다. 그러나 광범위한 연구에도 불구하고, 이러한 정제된 모델은 다양한 다운스트림 작업에서 교사 LLM의 성능을 따라가지 못하는 경우가 많습니다. 우리는 작업 집합에 대한 학생과 교사 모델 간의 허용오차 보정 Win-and-Tie율 측면에서 정의되는 무손실 정제라는 목표를 설정했습니다. 이를 위해 xLSTM 기반 학생 모델을 위한 효과적인 정제 파이프라인을 소개합니다. 개별적으로 선형화된 전문가들을 단일 모델로 결합하는 추가 병합 단계를 제안합니다. Llama, Qwen, Olmo 패밀리의 기본 모델 및 지령 튜닝 모델을 정제하여 이 파이프라인의 효과를 입증합니다. 많은 설정에서, 우리의 xLSTM 기반 학생 모델은 교사 모델 성능의 대부분을 회복하며, 일부 다운스트림 작업에서는 이를 능가하기도 합니다. 우리의 기여는 트랜스포머 기반 LLM을 대체할 더욱 에너지 효율적이고 비용 효율적인 모델로 나아가는 중요한 단계입니다.

English

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

효과적인 하이브리드 xLSTM 아키텍처로의 지식 증류

Effective Distillation to Hybrid xLSTM Architectures

초록

Support