Parcae: 안정적인 순환 언어 모델을 위한 스케일링 법칙

초록

기존의 고정 깊이 아키텍처는 일반적으로 매개변수 수를 증가시켜 학습 FLOPs를 확장함으로써 성능을 높이지만, 이는 더 높은 메모리 사용량이나 데이터 요구량이라는 대가를 수반합니다. 한 가지 대안으로 제안되는 것은 루프 아키텍처로, 이는 활성화를 여러 계층으로 구성된 블록을 루프 형태로 반복 통과시켜 FLOPs를 증가시킵니다. 유망하지만, 기존의 루프 아키텍처 학습 방법은 잔차 폭발 및 손실 급증 문제로 인해 불안정한 경우가 많습니다. 본 연구에서는 이러한 문제를 해결하기 위해 루프 연산을 잔차 스트림 상의 비선형 시변 동역학 시스템으로 재해석합니다. 이 시스템의 선형 근사를 통해, 기존 루프 아키텍처의 불안정성이 주입 매개변수의 큰 스펙트럼 노름에서 비롯됨을 확인했습니다. 이러한 불안정성 문제를 해결하기 위해, 우리는 음의 대각 매개변수화를 이산화하여 주입 매개변수의 스펙트럼 노름을 제한하는 새로운 안정적인 루프 아키텍처인 Parcae를 제안합니다. 그 결과, Parcae는 기존 대규모 루프 모델 대비 검증 퍼플렉서티를 최대 6.3%까지 낮췄습니다. 이 안정적인 루프 아키텍처를 바탕으로, 우리는 학습 및 추론 시 FLOPs를 증가시켜 성능을 개선하는 수단으로서 루프의 확장 법칙을 연구합니다. 학습 측면에서는 매개변수 수를 고정한 상태에서 FLOPs를 확장하기 위한 예측 가능한 멱법칙을 도출했습니다. 우리의 초기 확장 법칙에 따르면, 고정된 FLOPs 예산 내에서는 루프 반복 횟수와 데이터 양을 함께 증가시켜야 함을 시사합니다. 추론 시점에서는 Parcae가 루프를 활용하여 계산량을 확장할 수 있으며, 이는 예측 가능하고 포화되는 지수 함수적 감소 패턴을 따름을 확인했습니다. 모델 규모를 13억 개 매개변수까지 확장했을 때, Parcae는 고정된 매개변수 및 데이터 예산 하에서 강력한 Transformer 기준 모델 대비 CORE 및 Core-Extended 품질을 각각 2.99점과 1.18점 향상시켰으며, 크기가 두 배인 Transformer 대비 최대 87.5%의 상대적 품질을 달성했습니다.

English

Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.

Parcae: 안정적인 순환 언어 모델을 위한 스케일링 법칙

Parcae: Scaling Laws For Stable Looped Language Models

초록

Support