DLP: 대규모 언어 모델을 위한 동적 계층별 가지치기

초록

프루닝(Pruning)은 최근 대규모 언어 모델(LLM)의 파라미터 규모를 줄이고 추론 효율성을 개선하기 위해 널리 채택되고 있습니다. 주류 프루닝 기법은 종종 균일한 계층별 프루닝 전략에 의존하는데, 이는 높은 희소성 수준에서 심각한 성능 저하를 초래할 수 있습니다. LLM의 각 계층이 기여하는 바가 다르다는 점을 인식한 최근 연구들은 비균일 계층별 프루닝으로 초점을 옮겼습니다. 그러나 이러한 접근 방식은 종종 미리 정의된 값에 의존하기 때문에 최적의 성능을 달성하지 못할 수 있습니다. 이러한 한계를 극복하기 위해, 우리는 동적 계층별 프루닝(Dynamic Layerwise Pruning, DLP)이라는 새로운 방법을 제안합니다. 이 접근 방식은 모델 가중치와 입력 활성화 정보를 통합하여 각 계층의 상대적 중요도를 적응적으로 결정하고, 이에 따라 프루닝 비율을 할당합니다. 실험 결과, DLP는 여러 LLM에서 높은 희소성 수준에서도 모델 성능을 효과적으로 유지하는 것으로 나타났습니다. 구체적으로, 70% 희소성에서 DLP는 LLaMA2-7B의 복잡도(perplexity)를 7.79 감소시키고, 최신 기법 대비 평균 정확도를 2.7% 향상시켰습니다. 또한, DLP는 다양한 기존 LLM 압축 기법과 호환되며, 파라미터 효율적 미세 조정(Parameter-Efficient Fine-Tuning, PEFT)에 원활하게 통합될 수 있습니다. 우리는 향후 연구를 촉진하기 위해 코드를 https://github.com/ironartisan/DLP에 공개했습니다.

English

Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.

DLP: 대규모 언어 모델을 위한 동적 계층별 가지치기

DLP: Dynamic Layerwise Pruning in Large Language Models

초록

Support