Dynaword：從一次性到持續發展的數據集

摘要

大規模數據集是自然語言處理研究與發展的基石。然而，當前的方法面臨三大挑戰：(1)依賴於授權不明確的來源，限制了使用、共享及衍生作品的創作；(2)靜態的數據集發布模式，阻礙了社區貢獻並削弱了數據集的持久性；(3)質量保證流程僅限於發布團隊，未能充分利用社區的專業知識。為解決這些局限，我們提出了兩項貢獻：Dynaword方法與丹麥Dynaword。Dynaword方法是一個框架，用於創建可通過社區協作持續更新的大規模開放數據集。丹麥Dynaword則是這一方法的具體實踐，驗證了其可行性並展示了潛力。丹麥Dynaword包含的詞彙量是同類發布的四倍以上，完全採用開放授權，並已獲得來自產業與研究領域的多項貢獻。該資源庫包含輕量級測試，確保數據格式、質量及文檔的規範，為社區持續貢獻與數據集演進建立了可持續的框架。

English

Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over four times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry and research. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.