대규모 게이티드 델타 네트워크에서 특징 학습 구현하기

초록

대규모 언어 모델의 학습과 확장은 막대한 계산 자원을 요구하므로, 효율적인 서브-쿼드러틱 아키텍처와 원리 기반 하이퍼파라미터 튜닝 방법이 필요하게 되었다. Maximal Update Parametrization(μP)은 표준 트랜스포머에서 제로샷 하이퍼파라미터 전이를 가능하게 했지만, 이를 선형 모델, 특히 구조화된 상태 전이와 복잡한 아키텍처를 가진 모델로 확장하는 것은 아직 거의 탐구되지 않았다. 본 연구는 순전파, 게이팅 메커니즘, 그리고 순환 상태 동역학을 통해 좌표 크기 추정치를 엄격하게 전파함으로써, Gated Delta Network의 스케일링 규칙을 도출한다. 언어 모델 사전 학습 실험을 통해, 우리의 설정이 표준 매개변수화가 전이에 실패하는 반면, AdamW와 SGD 모두에서 모델 폭에 걸쳐 안정적인 학습률 전이가 가능함을 확인하였으며, 이는 분석의 정확성과 실용적 유용성을 입증한다.

English

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.