YouTube-SL-25:一个大规模、开放领域的多语种手语平行语料库
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus
July 15, 2024
作者: Garrett Tanzer, Biao Zhang
cs.AI
摘要
即使是像美国手语(ASL)这样研究较多的手语,数据仍然是机器学习研究的瓶颈。对于世界各地聋人/听障社区使用的许多其他手语来说,情况更糟。在本文中,我们介绍了YouTube-SL-25,这是一个大规模、开放领域的手语视频语料库,其中包含来自YouTube的似乎对齐良好的字幕。YouTube-SL-25拥有超过25种手语的3000多小时视频,a)是YouTube-ASL大小的3倍以上,b)是迄今为止最大的平行手语数据集,c)是许多组成语言的第一个或最大的平行数据集。我们使用基于T5的统一多语种多任务模型为手语到文本任务提供基线,并在4种手语的基准测试中报告分数。结果表明,多语种迁移对YouTube-SL-25中的高资源和低资源手语都有益。
English
Even for better-studied sign languages like American Sign Language (ASL),
data is the bottleneck for machine learning research. The situation is worse
yet for the many other sign languages used by Deaf/Hard of Hearing communities
around the world. In this paper, we present YouTube-SL-25, a large-scale,
open-domain multilingual corpus of sign language videos with seemingly
well-aligned captions drawn from YouTube. With >3000 hours of videos across >25
sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest
parallel sign language dataset to date, and c) the first or largest parallel
dataset for many of its component languages. We provide baselines for
sign-to-text tasks using a unified multilingual multitask model based on T5 and
report scores on benchmarks across 4 sign languages. The results demonstrate
that multilingual transfer benefits both higher- and lower-resource sign
languages within YouTube-SL-25.Summary
AI-Generated Summary