Paloma:用於評估語言模型適配度的基準
Paloma: A Benchmark for Evaluating Language Model Fit
December 16, 2023
作者: Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge
cs.AI
摘要
語言模型(LMs)通常在訓練過程中保留的單一數據上報告困惑度。這些數據隱含或明示地由不同領域的語言分佈組成。與其假設對一個分佈的困惑度可以推斷到其他分佈,語言模型評估的困惑度分析(Paloma)衡量了LM對585個文本領域的適應性,範圍從nytimes.com到Reddit上的r/depression。我們邀請提交至我們的基準測試,並根據符合指南的程度將結果進行比較。這些指南包括從預訓練中去除基準測試污染。提交還可以記錄參數和訓練標記數,以便根據這些成本衡量指標的帕累托效率進行性能比較。我們的基準測試中包含了6個在流行語料庫上預先訓練的基線結果。在案例研究中,我們展示了使用Paloma可能進行的分析,例如發現沒有超出Common Crawl數據的預訓練導致對許多領域的適應性不一致。
English
Language models (LMs) commonly report perplexity on monolithic data held out
from training. Implicitly or explicitly, this data is composed of
domainsx2013varying distributions of language. Rather than assuming
perplexity on one distribution extrapolates to others, Perplexity Analysis for
Language Model Assessment (Paloma), measures LM fit to 585 text domains,
ranging from nytimes.com to r/depression on Reddit. We invite submissions to
our benchmark and organize results by comparability based on compliance with
guidelines such as removal of benchmark contamination from pretraining.
Submissions can also record parameter and training token count to make
comparisons of Pareto efficiency for performance as a function of these
measures of cost. We populate our benchmark with results from 6 baselines
pretrained on popular corpora. In case studies, we demonstrate analyses that
are possible with Paloma, such as finding that pretraining without data beyond
Common Crawl leads to inconsistent fit to many domains.