ChatPaper.aiChatPaper

ATHAR:用於古典阿拉伯文到英文翻譯的高質量和多樣化數據集

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

July 29, 2024
作者: Mohammed Khalil, Mohammed Sabry
cs.AI

摘要

古典阿拉伯語代表了一個重要的時代,包括阿拉伯文化、哲學和科學文學的黃金時期。對於將這些文學作品翻譯成不同語言以豐富知識傳播在各個社群中的重要性有廣泛的共識,大型語言模型(LLMs)和翻譯系統的出現提供了有望實現這一目標的工具。然而,我們發現古典阿拉伯語的翻譯數據集稀缺,通常在範圍和主題上受限,阻礙了高質量翻譯系統的發展。為此,我們提出了ATHAR數據集,包括了6.6萬個高質量的古典阿拉伯語到英語的翻譯樣本,涵蓋了科學、文化和哲學等廣泛範疇。此外,我們評估了當前最先進的LLMs在不同設置下的性能,得出結論指出目前系統中需要這樣的數據集。我們的研究結果突顯了模型如何可以從微調或將此數據集納入其預訓練流程中受益。該數據集可在HuggingFace Data Hub上公開獲取,網址為https://huggingface.co/datasets/mohamed-khalil/ATHAR。
English
Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples that cover a wide array of subjects including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a need for such datasets in current systems. Our findings highlight how models can benefit from fine-tuning or incorporating this dataset into their pretraining pipelines. The dataset is publicly available on the HuggingFace Data Hub at https://huggingface.co/datasets/mohamed-khalil/ATHAR.

Summary

AI-Generated Summary

PDF211November 28, 2024