더 깊은 층의 비합리적 비효율성

초록

우리는 오픈 가중치 사전 학습된 대형 언어 모델(LLM) 계열에 대해 간단한 레이어 프루닝 전략을 실증적으로 연구하였으며, 상당한 비율(최대 절반)의 레이어가 제거될 때까지 다양한 질의응답 벤치마크에서 성능 저하가 최소화됨을 발견했습니다. 이러한 모델을 프루닝하기 위해, 우리는 레이어 간 유사성을 고려하여 최적의 레이어 블록을 식별한 후, 손상을 "치유"하기 위해 소량의 파인튜닝을 수행했습니다. 특히, 우리는 양자화 및 Low Rank Adapters(QLoRA)와 같은 파라미터 효율적 파인튜닝(PEFT) 방법을 사용하여 각 실험이 단일 A100 GPU에서 수행될 수 있도록 했습니다. 실용적인 관점에서, 이러한 결과는 레이어 프루닝 방법이 다른 PEFT 전략을 보완하여 파인튜닝에 필요한 계산 자원을 더욱 줄일 수 있을 뿐만 아니라, 추론 시 메모리와 지연 시간을 개선할 수 있음을 시사합니다. 과학적 관점에서, 이러한 LLM이 레이어 삭제에 대해 견고성을 보인다는 것은 현재의 사전 학습 방법이 네트워크의 깊은 레이어에 있는 파라미터를 제대로 활용하지 못하고 있거나, 얕은 레이어가 지식을 저장하는 데 중요한 역할을 하고 있음을 의미합니다.

English

We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.

더 깊은 층의 비합리적 비효율성

The Unreasonable Ineffectiveness of the Deeper Layers

초록

Support