Paloma:用于评估语言模型拟合度的基准
Paloma: A Benchmark for Evaluating Language Model Fit
December 16, 2023
作者: Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge
cs.AI
摘要
语言模型(LMs)通常在训练中排除的整体数据上报告困惑度。隐式或明确地,这些数据由领域-不同语言分布组成。与其假设一个分布上的困惑度可以推广到其他分布,语言模型评估的困惑度分析(Paloma)衡量LM适应585个文本领域,从nytimes.com到Reddit上的r/depression。我们邀请提交到我们的基准测试,并根据符合指南的结果可比性进行组织,例如去除预训练中的基准污染。提交还可以记录参数和训练令牌计数,以便根据这些成本度量的帕累托效率进行性能比较。我们将我们的基准测试与在流行语料库上预训练的6个基线结果填充。在案例研究中,我们展示了使用Paloma可能进行的分析,例如发现仅使用Common Crawl之外的数据进行预训练会导致对许多领域的适应性不一致。
English
Language models (LMs) commonly report perplexity on monolithic data held out
from training. Implicitly or explicitly, this data is composed of
domainsx2013varying distributions of language. Rather than assuming
perplexity on one distribution extrapolates to others, Perplexity Analysis for
Language Model Assessment (Paloma), measures LM fit to 585 text domains,
ranging from nytimes.com to r/depression on Reddit. We invite submissions to
our benchmark and organize results by comparability based on compliance with
guidelines such as removal of benchmark contamination from pretraining.
Submissions can also record parameter and training token count to make
comparisons of Pareto efficiency for performance as a function of these
measures of cost. We populate our benchmark with results from 6 baselines
pretrained on popular corpora. In case studies, we demonstrate analyses that
are possible with Paloma, such as finding that pretraining without data beyond
Common Crawl leads to inconsistent fit to many domains.