그룹 인식 SSM 프루닝을 통한 효율적인 하이브리드 언어 모델 압축

초록

어텐션(Attention)과 상태 공간 모델(State Space Models, SSMs)을 결합한 하이브리드 LLM 아키텍처는 최첨단 정확도와 런타임 성능을 달성합니다. 최근 연구에서는 어텐션만 사용하는 모델에 압축과 지식 증류를 적용함으로써 훈련 비용의 일부로 더 작으면서도 더 정확한 모델을 얻을 수 있음을 보여주었습니다. 본 연구에서는 하이브리드 아키텍처의 압축 효과를 탐구합니다. 우리는 SSM 블록의 구조적 무결성과 시퀀스 모델링 능력을 보존하는 새로운 그룹 인식 프루닝 전략을 소개합니다. 더 나아가, 기존 접근 방식에 비해 향상된 정확도와 추론 속도를 달성하기 위해 이러한 SSM 프루닝이 필수적임을 입증합니다. 우리의 압축 방법은 SSM, FFN, 임베딩 차원, 그리고 레이어 프루닝을 결합한 후, MINITRON 기법과 유사한 지식 증류 기반 재훈련을 수행합니다. 이 접근법을 통해 우리는 Nemotron-H 8B 하이브리드 모델을 최대 40배 적은 훈련 토큰으로 4B 매개변수까지 압축했습니다. 결과 모델은 유사한 크기의 모델들을 정확도에서 능가하면서도 2배 빠른 추론 속도를 달성하여 파레토 프론티어를 크게 발전시켰습니다.

English

Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

그룹 인식 SSM 프루닝을 통한 효율적인 하이브리드 언어 모델 압축

Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

초록

Support