Dynaword：从一次性数据集到持续演进的数据集

摘要

大规模数据集是自然语言处理研究与发展的基石。然而，当前方法面临三大关键挑战：(1)依赖许可模糊的来源，限制了使用、共享及衍生作品的创作；(2)静态数据集发布模式阻碍了社区贡献，削弱了数据集的持久性；(3)质量保证流程局限于发布团队，未能充分利用社区的专业知识。针对这些局限，我们提出了两项创新：Dynaword方法与丹麦Dynaword。Dynaword方法是一个框架，旨在通过社区协作创建可持续更新的大规模开放数据集。丹麦Dynaword则是该框架的具体实践，验证了其可行性并展现了潜力。丹麦Dynaword包含的词汇量是同类发布版本的四倍以上，完全采用开放许可，并已获得来自工业界和研究界的多次贡献。该资源库包含轻量级测试，确保数据格式、质量及文档的规范性，为持续的社区贡献和数据集演进建立了一个可持续的框架。

English

Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over four times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry and research. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.