SkyLadder: 컨텍스트 윈도우 스케줄링을 통한 더 나은 그리고 더 빠른 사전 학습

초록

최근 LLM 사전 학습의 발전은 더 긴 시퀀스를 처리하기 위해 점점 확장되는 컨텍스트 윈도우를 특징으로 하고 있습니다. 그러나 우리의 파일럿 연구에 따르면, 고정된 토큰 예산 하에서 더 짧은 컨텍스트 윈도우로 사전 학습된 모델들이 긴 컨텍스트를 사용한 모델들보다 지속적으로 더 나은 성능을 보였습니다. 이러한 발견은 긴 컨텍스트 처리 능력과 사전 학습 효율성 사이의 균형을 더 잘 맞추기 위한 최적의 컨텍스트 윈도우 스케줄링 전략을 탐구하도록 동기를 부여했습니다. 이를 위해, 우리는 짧은 컨텍스트에서 긴 컨텍스트로의 전환을 구현하는 간단하면서도 효과적인 접근 방식인 SkyLadder를 제안합니다. SkyLadder는 강력한 표준 벤치마크 성능을 유지하면서, 긴 컨텍스트 작업에서 기준선 결과를 맞추거나 능가합니다. 광범위한 실험을 통해, 우리는 100B 토큰에 대해 1B 파라미터 모델(최대 32K 컨텍스트)과 3B 파라미터 모델(8K 컨텍스트)을 사전 학습하여, SkyLadder가 일반 벤치마크에서 최대 3.7%의 일관된 성능 향상을 제공하면서 기준선 대비 최대 22% 더 빠른 학습 속도를 달성함을 입증했습니다. 코드는 https://github.com/sail-sg/SkyLadder에서 확인할 수 있습니다.

English

Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

SkyLadder: 컨텍스트 윈도우 스케줄링을 통한 더 나은 그리고 더 빠른 사전 학습

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

초록

Support