大規模言語モデルの下流タスク性能スケーリングの解明：クラスタリングに基づく視点

要旨

コンピューティング技術の急速な進歩により、大規模言語モデル（LLM）のトレーニングの規模とコストが劇的に増大しています。モデルトレーニング前に下流タスクの性能を正確に予測することは、効率的なリソース配分において極めて重要ですが、以下の2つの主要な制約により依然として困難な課題となっています：（1）「創発現象」、つまり下流の性能指標が大規模なトレーニングを経て初めて意味を持つため、小規模モデルを用いた予測が制限されること；（2）タスク難易度の不均一な分布と一貫したスケーリング則の欠如により、メトリックの変動が大きくなること。既存の性能予測手法は精度と信頼性に限界があり、LLMの潜在能力の評価を妨げています。これらの課題に対処するため、我々は「難易度に基づくクラスタリング（Clustering-On-Difficulty, COD）」を用いた下流性能予測フレームワークを提案します。CODはまず、難易度特徴に基づいてタスクをクラスタリングし、非創発的かつ非スケーラブルなクラスターを戦略的に除外することで、予測可能なサポートサブセットを構築します。選択されたサブセット上のスコアは、完全な評価セットにおける下流性能の効果的な中間予測指標として機能します。理論的裏付けに基づき、我々は予測可能なサブセットから完全な評価セットへの性能メトリックを変換するマッピング関数を導出し、LLMの下流性能を正確に外挿することを保証します。提案手法は70B LLMの性能スケーリング予測に適用され、トレーニングリソース配分のための実用的な洞察を提供し、トレーニングプロセスの監視を支援しました。特に、CODは小規模モデルのアンサンブルを活用することで70B LLMにおいて顕著な予測精度を達成し、8つの重要なLLM評価ベンチマークにおいて1.36%の絶対平均偏差を示しています。

English

The rapid advancements in computing dramatically increase the scale and cost of training Large Language Models (LLMs). Accurately predicting downstream task performance prior to model training is crucial for efficient resource allocation, yet remains challenging due to two primary constraints: (1) the "emergence phenomenon", wherein downstream performance metrics become meaningful only after extensive training, which limits the ability to use smaller models for prediction; (2) Uneven task difficulty distributions and the absence of consistent scaling laws, resulting in substantial metric variability. Existing performance prediction methods suffer from limited accuracy and reliability, thereby impeding the assessment of potential LLM capabilities. To address these challenges, we propose a Clustering-On-Difficulty (COD) downstream performance prediction framework. COD first constructs a predictable support subset by clustering tasks based on difficulty features, strategically excluding non-emergent and non-scalable clusters. The scores on the selected subset serve as effective intermediate predictors of downstream performance on the full evaluation set. With theoretical support, we derive a mapping function that transforms performance metrics from the predictable subset to the full evaluation set, thereby ensuring accurate extrapolation of LLM downstream performance. The proposed method has been applied to predict performance scaling for a 70B LLM, providing actionable insights for training resource allocation and assisting in monitoring the training process. Notably, COD achieves remarkable predictive accuracy on the 70B LLM by leveraging an ensemble of small models, demonstrating an absolute mean deviation of 1.36% across eight important LLM evaluation benchmarks.

大規模言語モデルの下流タスク性能スケーリングの解明：クラスタリングに基づく視点

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

要旨

Support