XTREME-UP: 저자원 언어를 위한 사용자 중심 희소 데이터 벤치마크

초록

데이터 부족은 고도로 다국어 NLP 시스템 개발에 있어 중요한 문제입니다. 그러나 많은 저자원 언어(ULs) — 사용자 요구를 충족시키는 데 있어 NLP 연구가 특히 뒤처진 언어 — 의 경우, 소량의 데이터에 주석을 달 수 있는 가능성이 있습니다. 이를 바탕으로, 우리는 XTREME-UP이라는 벤치마크를 제안합니다. 이 벤치마크는 다음과 같은 특징으로 정의됩니다: 제로샷이 아닌 희소 데이터 시나리오에 초점을 맞춘 점; 고자원 언어 사용자들 사이에서 널리 채택된 사용자 중심 작업에 초점을 맞춘 점; 그리고 이러한 희소 데이터 시나리오가 가장 현실적인 저자원 언어에 초점을 맞춘 점. XTREME-UP은 88개의 저자원 언어에 걸쳐 ASR, OCR, MT 및 정보 접근 작업과 같은 일반적으로 유용한 9가지 주요 사용자 중심 기술에 대한 언어 모델의 능력을 평가합니다. 우리는 OCR, 자동 완성, 의미 분석 및 음역을 위한 새로운 데이터셋을 생성하고, 다른 작업을 위해 기존 데이터셋을 기반으로 개선합니다. XTREME-UP은 텍스트 전용, 다중 모달(비전, 오디오 및 텍스트), 지도 파라미터 튜닝 및 인컨텍스트 학습을 포함한 다양한 모델링 시나리오를 평가하기 위한 방법론을 제공합니다. 우리는 일반적으로 사용되는 모델을 벤치마크에서 평가합니다. 모든 코드와 모델을 학습 및 평가하기 위한 스크립트를 공개합니다.

English

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models

XTREME-UP: 저자원 언어를 위한 사용자 중심 희소 데이터 벤치마크

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

초록

Support