ChatPaper.aiChatPaper

开放语言数据倡议:推动卡拉卡尔帕克语低资源机器翻译

Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak

September 6, 2024
作者: Mukhammadsaid Mamasaidov, Abror Shopulatov
cs.AI

摘要

本研究针对卡拉卡尔帕克语提出了几项贡献:将FLORES+ devtest数据集翻译成卡拉卡尔帕克语,构建了乌兹别克语-卡拉卡尔帕克语、俄语-卡拉卡尔帕克语和英语-卡拉卡尔帕克语各10万对平行语料库,并开源了针对这些语言的微调神经模型用于翻译。我们的实验比较了不同模型变体和训练方法,展示了相对于现有基准的改进。这项工作是作为开放语言数据倡议(OLDI)共享任务的一部分进行的,旨在推进卡拉卡尔帕克语的机器翻译能力,并促进自然语言处理技术中的语言多样性扩展。
English
This study presents several contributions for the Karakalpak language: a FLORES+ devtest dataset translated to Karakalpak, parallel corpora for Uzbek-Karakalpak, Russian-Karakalpak and English-Karakalpak of 100,000 pairs each and open-sourced fine-tuned neural models for translation across these languages. Our experiments compare different model variants and training approaches, demonstrating improvements over existing baselines. This work, conducted as part of the Open Language Data Initiative (OLDI) shared task, aims to advance machine translation capabilities for Karakalpak and contribute to expanding linguistic diversity in NLP technologies.

Summary

AI-Generated Summary

PDF113November 16, 2024