XTREME-UP:面向用户的稀缺数据基准,用于代表性不足的语言。
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
May 19, 2023
作者: Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson, Dmitry Panteleev, Partha Talukdar
cs.AI
摘要
数据稀缺是高度多语言自然语言处理系统发展中的一个关键问题。然而,对于许多代表性不足的语言(ULs)——即自然语言处理研究在满足用户需求方面特别落后的语言,注释少量数据是可行的。受此启发,我们提出了XTREME-UP,一个基准测试,其特点是:专注于稀缺数据情景而非零-shot;专注于用户中心任务——这些任务被高资源语言使用者广泛采用;以及专注于代表性不足语言,在这些语言中,稀缺数据情景往往最为现实。XTREME-UP评估语言模型在88种代表性不足语言上的能力,涵盖9个关键的用户中心技术,包括ASR、OCR、MT和信息访问任务,这些任务具有普遍实用性。我们为OCR、自动完成、语义解析和音译创建了新数据集,并在其他任务上构建和完善现有数据集。XTREME-UP提供了评估多种建模情景的方法,包括仅文本、多模态(视觉、音频和文本)、监督参数调整和上下文学习。我们在基准测试上评估了常用模型。我们公开所有用于训练和评估模型的代码和脚本。
English
Data scarcity is a crucial issue for the development of highly multilingual
NLP systems. Yet for many under-represented languages (ULs) -- languages for
which NLP re-search is particularly far behind in meeting user needs -- it is
feasible to annotate small amounts of data. Motivated by this, we propose
XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather
than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by
speakers of high-resource languages; and its focus on under-represented
languages where this scarce-data scenario tends to be most realistic. XTREME-UP
evaluates the capabilities of language models across 88 under-represented
languages over 9 key user-centric technologies including ASR, OCR, MT, and
information access tasks that are of general utility. We create new datasets
for OCR, autocomplete, semantic parsing, and transliteration, and build on and
refine existing datasets for other tasks. XTREME-UP provides methodology for
evaluating many modeling scenarios including text-only, multi-modal (vision,
audio, and text),supervised parameter tuning, and in-context learning. We
evaluate commonly used models on the benchmark. We release all code and scripts
to train and evaluate models