揭示大型語言模型下游性能的擴展規律：基於聚類的視角

摘要

計算技術的快速發展大幅提升了大型語言模型（LLMs）訓練的規模與成本。在模型訓練前準確預測下游任務表現，對於資源的高效分配至關重要，然而這一目標仍面臨兩大主要挑戰：(1)「湧現現象」，即下游性能指標僅在經過大量訓練後才具備意義，這限制了使用較小模型進行預測的能力；(2)任務難度分佈不均及缺乏一致的擴展規律，導致性能指標存在顯著波動。現有的性能預測方法在準確性和可靠性方面存在局限，從而阻礙了對LLM潛在能力的評估。為應對這些挑戰，我們提出了一種基於難度聚類（Clustering-On-Difficulty, COD）的下游性能預測框架。COD首先通過根據難度特徵對任務進行聚類，構建一個可預測的支持子集，策略性地排除非湧現性和不可擴展的聚類。選定子集上的得分作為對完整評估集下游性能的有效中間預測指標。在理論支持下，我們推導出一個映射函數，將性能指標從可預測子集轉換至完整評估集，從而確保了LLM下游性能的準確外推。所提方法已應用於預測一個70B LLM的性能擴展，為訓練資源分配提供了可操作的見解，並協助監控訓練過程。值得注意的是，COD通過集成小模型，在70B LLM上實現了顯著的預測準確性，在八個重要的LLM評估基準上展示了1.36%的絕對平均偏差。

English

The rapid advancements in computing dramatically increase the scale and cost of training Large Language Models (LLMs). Accurately predicting downstream task performance prior to model training is crucial for efficient resource allocation, yet remains challenging due to two primary constraints: (1) the "emergence phenomenon", wherein downstream performance metrics become meaningful only after extensive training, which limits the ability to use smaller models for prediction; (2) Uneven task difficulty distributions and the absence of consistent scaling laws, resulting in substantial metric variability. Existing performance prediction methods suffer from limited accuracy and reliability, thereby impeding the assessment of potential LLM capabilities. To address these challenges, we propose a Clustering-On-Difficulty (COD) downstream performance prediction framework. COD first constructs a predictable support subset by clustering tasks based on difficulty features, strategically excluding non-emergent and non-scalable clusters. The scores on the selected subset serve as effective intermediate predictors of downstream performance on the full evaluation set. With theoretical support, we derive a mapping function that transforms performance metrics from the predictable subset to the full evaluation set, thereby ensuring accurate extrapolation of LLM downstream performance. The proposed method has been applied to predict performance scaling for a 70B LLM, providing actionable insights for training resource allocation and assisting in monitoring the training process. Notably, COD achieves remarkable predictive accuracy on the 70B LLM by leveraging an ensemble of small models, demonstrating an absolute mean deviation of 1.36% across eight important LLM evaluation benchmarks.

揭示大型語言模型下游性能的擴展規律：基於聚類的視角

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

摘要

Support