開放語言數據倡議：推動卡拉卡爾帕克低資源機器翻譯

摘要

本研究為卡拉卡爾帕克語提供了幾項貢獻：將FLORES+ devtest數據集翻譯成卡拉卡爾帕克語，建立了烏茲別克語-卡拉卡爾帕克語、俄羅斯語-卡拉卡爾帕克語和英語-卡拉卡爾帕克語各10萬對平行語料庫，並提供了針對這些語言的開源微調神經模型進行翻譯。我們的實驗比較了不同模型變體和訓練方法，展示了相對於現有基準的改進。這項工作作為開放語言數據倡議（OLDI）共享任務的一部分進行，旨在提升卡拉卡爾帕克語的機器翻譯能力，並促進自然語言處理技術中的語言多樣性擴展。

English

This study presents several contributions for the Karakalpak language: a FLORES+ devtest dataset translated to Karakalpak, parallel corpora for Uzbek-Karakalpak, Russian-Karakalpak and English-Karakalpak of 100,000 pairs each and open-sourced fine-tuned neural models for translation across these languages. Our experiments compare different model variants and training approaches, demonstrating improvements over existing baselines. This work, conducted as part of the Open Language Data Initiative (OLDI) shared task, aims to advance machine translation capabilities for Karakalpak and contribute to expanding linguistic diversity in NLP technologies.