사전 학습된 대형 언어 모델(LLM)에서의 적응형 계층 건너뛰기

초록

대규모 언어 모델(LLM)에서 토큰 생성을 가속화하기 위해 다양한 레이어 스킵 방법이 제안되어 왔습니다. 그러나 이러한 방법들은 근본적인 질문을 간과해 왔습니다: 서로 다른 토큰을 생성할 때 계산 요구 사항은 어떻게 달라지는가? 본 연구에서는 텍스트 생성에 사용되는 Transformer 레이어 수를 동적으로 조절하는 FlexiDepth 방법을 소개합니다. 플러그인 라우터와 어댑터를 도입함으로써, FlexiDepth는 LLM의 원래 파라미터를 수정하지 않고도 적응형 레이어 스킵을 가능하게 합니다. Llama-3-8B 모델에 FlexiDepth를 적용한 결과, 32개 레이어 중 8개를 스킵하면서도 벤치마크 성능을 100% 유지할 수 있었습니다. FlexiDepth를 통한 실험 결과는 LLM의 계산 요구 사항이 토큰 유형에 따라 크게 달라짐을 보여줍니다. 구체적으로, 반복적인 토큰이나 고정된 구문을 생성할 때는 더 적은 레이어가 필요하지만, 계산이 포함되거나 불확실성이 높은 토큰을 생성할 때는 더 많은 레이어가 필요합니다. 흥미롭게도, 이러한 적응형 할당 패턴은 인간의 직관과 일치합니다. 이 분야의 연구를 더욱 발전시키기 위해, 우리는 FlexiDepth와 FlexiDepth의 레이어 할당 패턴을 기록한 데이터셋을 오픈소스로 공개하여 향후 탐구를 돕고자 합니다.

English

Various layer-skipping methods have been proposed to accelerate token generation in large language models (LLMs). However, they have overlooked a fundamental question: How do computational demands vary across the generation of different tokens? In this work, we introduce FlexiDepth, a method that dynamically adjusts the number of Transformer layers used in text generation. By incorporating a plug-in router and adapter, FlexiDepth enables adaptive layer-skipping in LLMs without modifying their original parameters. Introducing FlexiDepth to Llama-3-8B model achieves layer skipping of 8 layers out of 32, and meanwhile maintains the full 100\% benchmark performance. Experimental results with FlexiDepth demonstrate that computational demands in LLMs significantly vary based on token type. Specifically, generating repetitive tokens or fixed phrases requires fewer layers, whereas producing tokens involving computation or high uncertainty requires more layers. Interestingly, this adaptive allocation pattern aligns with human intuition. To advance research in this area, we open sourced FlexiDepth and a dataset documenting FlexiDepth's layer allocation patterns for future exploration.

사전 학습된 대형 언어 모델(LLM)에서의 적응형 계층 건너뛰기

Adaptive Layer-skipping in Pre-trained LLMs

초록

Support