단축된 LLaMA: 대규모 언어 모델을 위한 간단한 깊이 가지치기

초록

현대의 대규모 언어 모델(LLM)에 대한 구조적 가지치기(pruning)는 높은 계산 요구량을 줄이는 방법으로 부상하고 있다. 너비 가지치기(width pruning)는 투영 가중치 행렬의 크기를 줄이지만(예: 어텐션 헤드 제거), 층 수는 유지한다. 반면, 깊이 가지치기(depth pruning)는 전체 층이나 블록을 제거하면서 남아 있는 가중치의 크기는 그대로 유지한다. 현재 대부분의 연구는 너비 가지치기만을 다루거나 너비와 깊이 가지치기를 혼합한 방식에 초점을 맞추고 있으며, 두 가지치기 단위(너비 대 깊이)가 LLM 추론 효율성에 미치는 영향에 대한 비교 분석은 거의 이루어지지 않았다. 본 연구에서는 단순한 깊이 가지치기 접근법이 최근의 너비 가지치기 방법들과 제로샷(zero-shot) 작업 성능 측면에서 경쟁할 수 있음을 보여준다. 우리의 가지치기 방법은 특히 LLM 실행을 위해 제한된 배치 크기가 요구되는 메모리 제약 조건에서 추론 속도를 향상시키며, 이러한 조건에서는 너비 가지치기가 효과적이지 않다. 이 연구가 LLM을 로컬 및 엣지 디바이스에 배포하는 데 도움이 되기를 바란다.

English

Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that a simple depth pruning approach can compete with recent width pruning methods in terms of zero-shot task performance. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. We hope this work can help deploy LLMs on local and edge devices.

단축된 LLaMA: 대규모 언어 모델을 위한 간단한 깊이 가지치기

Shortened LLaMA: A Simple Depth Pruning for Large Language Models

초록

Support