프록시 지표를 사용한 LLM의 하류 성능 예측

초록

언어 모델 개발의 진전은 종종 비교 결정, 즉 어떤 아키텍처를 채택할지, 어떤 사전 학습 코퍼스를 사용할지, 또는 어떤 학습 레시피를 적용할지에 의해 주도됩니다. 이러한 결정을 잘 내리기 위해서는 신뢰할 수 있는 성능 예측이 필요하지만, 일반적으로 사용되는 두 신호는 근본적으로 한계가 있습니다. 교차 엔트로피 손실은 다운스트림 능력과 잘 정렬되지 않으며, 직접적인 다운스트림 평가는 비용이 많이 들고 희소하며 초기 학습 단계에서는 종종 정보를 제공하지 않습니다. 이에 따라, 우리는 전문가가 작성한 솔루션에 대한 후보 모델의 다음 토큰 분포에서 엔트로피, 상위-k 정확도, 전문가 토큰 순위와 같은 토큰 수준 통계를 집계하여 프록시 메트릭을 구성할 것을 제안합니다. 세 가지 설정에서 우리의 프록시는 일관되게 손실 및 계산 기반 기준선을 능가합니다. 1) 교차 계열 모델 선택의 경우, 평균 Spearman Rho = 0.81(교차 엔트로피 손실의 경우 Rho = 0.36)로 이질적인 추론 모델 집단을 순위화합니다. 2) 사전 학습 데이터 선택의 경우, 직접 평가보다 약 10,000배 적은 계산으로 대상 모델에 대한 25개의 후보 코퍼스를 신뢰성 있게 순위화하여 파레토 프런티어를 기존 방법 이상으로 확장합니다. 3) 학습 시간 예측의 경우, 기존 대안의 오류의 약 절반 수준으로 18배의 계산 범위에 걸쳐 다운스트림 정확도를 외삽합니다. 종합하면, 이러한 결과는 전문가 궤적이 모델 능력을 평가하기 위한 광범위하게 유용한 신호 원천이며, 모델 개발 수명 주기 전반에 걸쳐 신뢰할 수 있는 성능 예측을 가능하게 함을 시사합니다.

English

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.