因地制宜:利用合成与原数据构建巴什基尔语、哈萨克语、吉尔吉斯语、鞑靼语及楚瓦什语翻译系统
No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data
February 4, 2026
作者: Dmitry Karpov
cs.AI
摘要
我们针对五组突厥语系机器翻译展开研究:俄语-巴什基尔语、俄语-哈萨克语、俄语-吉尔吉斯语、英语-鞑靼语、英语-楚瓦什语。通过在合成数据上采用LoRA技术微调nllb-200-distilled-600M模型,哈萨克语达到chrF++ 49.71分,巴什基尔语达到46.94分。基于检索相似示例的DeepSeek-V3.2提示学习使楚瓦什语取得chrF++ 39.47分。鞑靼语的零样本与检索方法获得chrF++ 41.6分,而吉尔吉斯语的零样本方法则达到45.6分。我们公开了数据集及训练所得的权重参数。
English
We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.