언어 모델을 위한 병렬 확장 법칙

초록

일반적으로 언어 모델의 규모를 확장하려면 매개변수를 증가시키는 방식(매개변수 스케일링)이나 출력 토큰을 증가시키는 방식(추론 시간 스케일링)을 통해 상당한 공간 또는 시간 비용이 발생한다고 여겨집니다. 우리는 세 번째이자 더 효율적인 추론을 가능하게 하는 스케일링 패러다임을 소개합니다: 바로 훈련 및 추론 시간 동안 모델의 병렬 계산을 증가시키는 것입니다. 우리는 입력에 P개의 다양하고 학습 가능한 변환을 적용하고, 모델의 순전파를 병렬로 실행한 후, P개의 출력을 동적으로 집계합니다. 이 방법, 즉 병렬 스케일링(ParScale)은 기존 매개변수를 재사용하여 병렬 계산을 확장하며, 어떤 모델 구조, 최적화 절차, 데이터 또는 작업에도 적용할 수 있습니다. 우리는 이론적으로 새로운 스케일링 법칙을 제안하고 대규모 사전 훈련을 통해 이를 검증했으며, P개의 병렬 스트림을 가진 모델은 매개변수를 O(log P)만큼 확장한 것과 유사한 성능을 보이면서도 더 우수한 추론 효율성을 보여줍니다. 예를 들어, ParScale은 동일한 성능 향상을 달성하기 위해 매개변수 스케일링에 비해 최대 22배 적은 메모리 증가와 6배 적은 지연 시간 증가를 사용할 수 있습니다. 또한, 소량의 토큰에 대해 사후 훈련을 통해 기존의 사전 훈련된 모델을 병렬 스케일링된 모델로 재활용할 수 있어 훈련 예산을 더욱 절감할 수 있습니다. 우리가 발견한 새로운 스케일링 법칙은 저자원 환경에서 더 강력한 모델의 배포를 촉진할 수 있으며, 머신러닝에서 계산의 역할에 대한 대안적인 관점을 제공합니다.

English

It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply P diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the P outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with P parallel streams is similar to scaling the parameters by O(log P) while showing superior inference efficiency. For example, ParScale can use up to 22times less memory increase and 6times less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

언어 모델을 위한 병렬 확장 법칙

Parallel Scaling Law for Language Models

초록

Support