Dynaword:从一次性数据集到持续演进的数据集
Dynaword: From One-shot to Continuously Developed Datasets
August 4, 2025
作者: Kenneth Enevoldsen, Kristian Nørgaard Jensen, Jan Kostkan, Balázs Szabó, Márton Kardos, Kirten Vad, Andrea Blasi Núñez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per Møldrup Dalum, Desmond Elliott, Lukas Galke, Peter Schneider-Kamp, Kristoffer Nielbo
cs.AI
摘要
大规模数据集是自然语言处理研究与发展的基石。然而,当前方法面临三大关键挑战:(1)依赖许可模糊的来源,限制了使用、共享及衍生作品的创作;(2)静态数据集发布模式阻碍了社区贡献,削弱了数据集的持久性;(3)质量保证流程局限于发布团队,未能充分利用社区的专业知识。
针对这些局限,我们提出了两项创新:Dynaword方法与丹麦Dynaword。Dynaword方法是一个框架,旨在通过社区协作创建可持续更新的大规模开放数据集。丹麦Dynaword则是该框架的具体实践,验证了其可行性并展现了潜力。丹麦Dynaword包含的词汇量是同类发布版本的四倍以上,完全采用开放许可,并已获得来自工业界和研究界的多次贡献。该资源库包含轻量级测试,确保数据格式、质量及文档的规范性,为持续的社区贡献和数据集演进建立了一个可持续的框架。
English
Large-scale datasets are foundational for research and development in natural
language processing. However, current approaches face three key challenges: (1)
reliance on ambiguously licensed sources restricting use, sharing, and
derivative works; (2) static dataset releases that prevent community
contributions and diminish longevity; and (3) quality assurance processes
restricted to publishing teams rather than leveraging community expertise.
To address these limitations, we introduce two contributions: the Dynaword
approach and Danish Dynaword. The Dynaword approach is a framework for creating
large-scale, open datasets that can be continuously updated through community
collaboration. Danish Dynaword is a concrete implementation that validates this
approach and demonstrates its potential. Danish Dynaword contains over four
times as many tokens as comparable releases, is exclusively openly licensed,
and has received multiple contributions across industry and research. The
repository includes light-weight tests to ensure data formatting, quality, and
documentation, establishing a sustainable framework for ongoing community
contributions and dataset evolution.