Dynaword:從一次性到持續發展的數據集
Dynaword: From One-shot to Continuously Developed Datasets
August 4, 2025
作者: Kenneth Enevoldsen, Kristian Nørgaard Jensen, Jan Kostkan, Balázs Szabó, Márton Kardos, Kirten Vad, Andrea Blasi Núñez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per Møldrup Dalum, Desmond Elliott, Lukas Galke, Peter Schneider-Kamp, Kristoffer Nielbo
cs.AI
摘要
大規模數據集是自然語言處理研究與發展的基石。然而,當前的方法面臨三大挑戰:(1)依賴於授權不明確的來源,限制了使用、共享及衍生作品的創作;(2)靜態的數據集發布模式,阻礙了社區貢獻並削弱了數據集的持久性;(3)質量保證流程僅限於發布團隊,未能充分利用社區的專業知識。為解決這些局限,我們提出了兩項貢獻:Dynaword方法與丹麥Dynaword。Dynaword方法是一個框架,用於創建可通過社區協作持續更新的大規模開放數據集。丹麥Dynaword則是這一方法的具體實踐,驗證了其可行性並展示了潛力。丹麥Dynaword包含的詞彙量是同類發布的四倍以上,完全採用開放授權,並已獲得來自產業與研究領域的多項貢獻。該資源庫包含輕量級測試,確保數據格式、質量及文檔的規範,為社區持續貢獻與數據集演進建立了可持續的框架。
English
Large-scale datasets are foundational for research and development in natural
language processing. However, current approaches face three key challenges: (1)
reliance on ambiguously licensed sources restricting use, sharing, and
derivative works; (2) static dataset releases that prevent community
contributions and diminish longevity; and (3) quality assurance processes
restricted to publishing teams rather than leveraging community expertise.
To address these limitations, we introduce two contributions: the Dynaword
approach and Danish Dynaword. The Dynaword approach is a framework for creating
large-scale, open datasets that can be continuously updated through community
collaboration. Danish Dynaword is a concrete implementation that validates this
approach and demonstrates its potential. Danish Dynaword contains over four
times as many tokens as comparable releases, is exclusively openly licensed,
and has received multiple contributions across industry and research. The
repository includes light-weight tests to ensure data formatting, quality, and
documentation, establishing a sustainable framework for ongoing community
contributions and dataset evolution.