LLM의 다운스트림 성능 스케일링 해명: 클러스터링 기반 관점

초록

컴퓨팅 기술의 급속한 발전은 대규모 언어 모델(LLM) 훈련의 규모와 비용을 크게 증가시키고 있습니다. 모델 훈련 전에 다운스트림 작업 성능을 정확하게 예측하는 것은 효율적인 자원 할당에 있어 매우 중요하지만, 두 가지 주요 제약으로 인해 여전히 어려운 과제로 남아 있습니다: (1) "발현 현상(emergence phenomenon)"으로, 다운스트림 성능 지표가 광범위한 훈련 후에야 의미를 갖게 되어 더 작은 모델을 사용한 예측이 제한되며, (2) 작업 난이도 분포의 불균일성과 일관된 스케일링 법칙의 부재로 인해 지표 변동성이 크게 발생합니다. 기존의 성능 예측 방법들은 정확성과 신뢰성이 제한적이어서 잠재적인 LLM 능력 평가에 방해가 되고 있습니다. 이러한 문제를 해결하기 위해, 우리는 작업 난이도를 기반으로 클러스터링하여 예측 가능한 지원 하위 집합을 구성하고, 비발현적(non-emergent) 및 비확장적(non-scalable) 클러스터를 전략적으로 제외하는 Clustering-On-Difficulty (COD) 다운스트림 성능 예측 프레임워크를 제안합니다. 선택된 하위 집합에서의 점수는 전체 평가 세트에 대한 다운스트림 성능의 효과적인 중간 예측 지표로 작용합니다. 이론적 근거를 바탕으로, 우리는 예측 가능한 하위 집합의 성능 지표를 전체 평가 세트로 변환하는 매핑 함수를 도출하여 LLM 다운스트림 성능의 정확한 외삽을 보장합니다. 제안된 방법은 70B LLM의 성능 스케일링 예측에 적용되어 훈련 자원 할당에 대한 실행 가능한 통찰을 제공하고 훈련 과정 모니터링을 지원했습니다. 특히, COD는 소규모 모델 앙상블을 활용하여 70B LLM에서 놀라운 예측 정확도를 달성했으며, 8개의 중요한 LLM 평가 벤치마크에서 평균 절대 편차 1.36%를 기록했습니다.

English

The rapid advancements in computing dramatically increase the scale and cost of training Large Language Models (LLMs). Accurately predicting downstream task performance prior to model training is crucial for efficient resource allocation, yet remains challenging due to two primary constraints: (1) the "emergence phenomenon", wherein downstream performance metrics become meaningful only after extensive training, which limits the ability to use smaller models for prediction; (2) Uneven task difficulty distributions and the absence of consistent scaling laws, resulting in substantial metric variability. Existing performance prediction methods suffer from limited accuracy and reliability, thereby impeding the assessment of potential LLM capabilities. To address these challenges, we propose a Clustering-On-Difficulty (COD) downstream performance prediction framework. COD first constructs a predictable support subset by clustering tasks based on difficulty features, strategically excluding non-emergent and non-scalable clusters. The scores on the selected subset serve as effective intermediate predictors of downstream performance on the full evaluation set. With theoretical support, we derive a mapping function that transforms performance metrics from the predictable subset to the full evaluation set, thereby ensuring accurate extrapolation of LLM downstream performance. The proposed method has been applied to predict performance scaling for a 70B LLM, providing actionable insights for training resource allocation and assisting in monitoring the training process. Notably, COD achieves remarkable predictive accuracy on the 70B LLM by leveraging an ensemble of small models, demonstrating an absolute mean deviation of 1.36% across eight important LLM evaluation benchmarks.

LLM의 다운스트림 성능 스케일링 해명: 클러스터링 기반 관점

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

초록

Support