점감형 언어 모델

초록

최신 언어 모델—트랜스포머, 순환 신경망, 메모리 기반 변형을 포함하여—은 깊이에 걸쳐 파라미터가 균일하게 할당된 동일한 층의 스택이라는 공통된 기본 구조를 공유한다. 이는 원래 트랜스포머로부터 계승된 기본값으로, 이후 거의 변경되지 않았으나, 점점 더 많은 증거는 층이 최종 출력에 비균일하게 기여하며, 후반 층이 잔차 스트림을 변환하기보다 정제한다는 것을 시사한다. 본 연구는 파라미터 용량이 이러한 비대칭성을 반영해야 하는지 묻는다. 통제된 실험 결과, 고정된 예산 하에서 초기 층에 더 많은 용량을 할당하고 후기 층에 더 적게 할당하는 것이 균일한 너비 기준선에 비해 혼란도를 개선하는 반면, 반대 할당은 성능을 저하시킨다. 이 결과를 바탕으로, 본 연구는 테이퍼드 언어 모델(Tapered Language Models, TLMs)을 도입한다. 이는 고정된 총 예산 하에서 파라미터를 포함하는 구성 요소가 깊이에 따라 단조롭게 점감되는 아키텍처 원리이다. MLP는 이러한 구현에 자연스러운 대상이다. MLP는 모든 최신 언어 모델 계열에서 파라미터 수를 지배하며, 너비를 단일하고 깔끔한 변동 축으로 제공한다. 세 가지 모델 규모와 네 가지 아키텍처(트랜스포머, 게이티드 어텐션, 호프 어텐션, 타이탄스)에 걸쳐, 부드러운 코사인 스케줄을 통해 MLP 너비를 점감하면 추가적인 파라미터나 계산 비용 없이 균일한 기준선에 비해 일관되게 혼란도와 하류 벤치마크 성능이 개선된다. 이러한 발견은 깊이를 고려한 용량 할당이 아키텍처에 구애받지 않는 단순한 언어 모델 설계 축임을 입증하며, 이는 눈에 띄지 않는 자유 레버가 숨겨져 있음을 보여준다.

English

Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.