为什么预测规模下游能力的前沿AI模型仍然难以实现?
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?
June 6, 2024
作者: Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo
cs.AI
摘要
从扩展先进AI系统中获得可预测的行为是一种极其理想的特性。尽管有大量文献已经阐明了预训练性能如何扩展,但关于特定下游能力如何扩展的文献则显得更加混乱。在这项工作中,我们退后一步问:为什么预测规模下特定下游能力仍然是困难的?虽然肯定有许多因素起作用,但我们确定了一个新因素,这个因素使得在广泛使用的多项选择问答基准上建模扩展行为变得具有挑战性。通过使用五种模型系列和十二个广泛应用的多项选择基准,我们展示了下游性能是通过负对数似然逐渐降低统计关系而计算的,这种关系是性能和规模之间的。然后,我们揭示了导致这种降级的机制:下游指标需要将正确选择与少量特定的错误选择进行比较,这意味着准确预测下游能力不仅需要预测概率质量如何随规模集中在正确选择上,还需要预测概率质量如何随规模在特定错误选择上波动。我们通过实证研究了随着计算量的增加,正确选择上的概率质量如何与错误选择上的概率质量协变,表明错误选择的规模定律可能是可以实现的。我们的工作还解释了为什么预训练规模定律通常被认为比下游能力更可预测,并有助于建立对前沿AI模型的规模可预测评估。
English
Predictable behavior from scaling advanced AI systems is an extremely
desirable property. Although a well-established literature exists on how
pretraining performance scales, the literature on how particular downstream
capabilities scale is significantly muddier. In this work, we take a step back
and ask: why has predicting specific downstream capabilities with scale
remained elusive? While many factors are certainly responsible, we identify a
new factor that makes modeling scaling behavior on widely used multiple-choice
question-answering benchmarks challenging. Using five model families and twelve
well-established multiple-choice benchmarks, we show that downstream
performance is computed from negative log likelihoods via a sequence of
transformations that progressively degrade the statistical relationship between
performance and scale. We then reveal the mechanism causing this degradation:
downstream metrics require comparing the correct choice against a small number
of specific incorrect choices, meaning accurately predicting downstream
capabilities requires predicting not just how probability mass concentrates on
the correct choice with scale, but also how probability mass fluctuates on
specific incorrect choices with scale. We empirically study how probability
mass on the correct choice co-varies with probability mass on incorrect choices
with increasing compute, suggesting that scaling laws for incorrect choices
might be achievable. Our work also explains why pretraining scaling laws are
commonly regarded as more predictable than downstream capabilities and
contributes towards establishing scaling-predictable evaluations of frontier AI
models.Summary
AI-Generated Summary