ChatPaper.aiChatPaper

YouTube-SL-25:一個大規模、開放領域的多語言手語平行語料庫

YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

July 15, 2024
作者: Garrett Tanzer, Biao Zhang
cs.AI

摘要

即使對於像美國手語(ASL)這樣研究較為深入的手語,數據仍然是機器學習研究的瓶頸。對於世界各地聽障社區使用的許多其他手語而言,情況更為嚴重。在本文中,我們介紹了YouTube-SL-25,這是一個大規模、開放領域的手語視頻語料庫,其中包含來自YouTube的似乎對齊良好的字幕。YouTube-SL-25擁有超過25種手語的3000多小時視頻,a)是YouTube-ASL規模的3倍以上,b)是迄今為止最大的平行手語數據集,c)是許多成分語言的第一個或最大的平行數據集。我們使用基於T5的統一多語種多任務模型為手語到文本任務提供基準線,並在4種手語的基準測試中報告得分。結果表明,多語種轉移對YouTube-SL-25中的高資源和低資源手語都有益。
English
Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions drawn from YouTube. With >3000 hours of videos across >25 sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest parallel sign language dataset to date, and c) the first or largest parallel dataset for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higher- and lower-resource sign languages within YouTube-SL-25.

Summary

AI-Generated Summary

PDF94November 28, 2024