ATHAR:用于古典阿拉伯语到英语翻译的高质量和多样化数据集
ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation
July 29, 2024
作者: Mohammed Khalil, Mohammed Sabry
cs.AI
摘要
古典阿拉伯语代表了一个重要的时代,涵盖了阿拉伯文化、哲学和科学文学的黄金时期。对于将这些文学作品翻译成丰富知识传播跨社区的重要性有着广泛共识,大型语言模型(LLMs)和翻译系统的出现提供了有希望的工具来促进这一目标的实现。然而,我们发现古典阿拉伯语的翻译数据集稀缺,通常在范围和主题上受限,阻碍了高质量翻译系统的发展。作为回应,我们提出了ATHAR数据集,包括了6.6万个高质量的古典阿拉伯语到英语的翻译样本,涵盖了科学、文化和哲学等广泛领域。此外,我们评估了当前最先进的LLMs在不同设置下的性能,得出结论当前系统需要这样的数据集。我们的研究结果突显了模型如何可以从微调或将该数据集纳入其预训练流程中受益。该数据集可以在HuggingFace Data Hub上公开获取,链接为https://huggingface.co/datasets/mohamed-khalil/ATHAR。
English
Classical Arabic represents a significant era, encompassing the golden age of
Arab culture, philosophy, and scientific literature. With a broad consensus on
the importance of translating these literatures to enrich knowledge
dissemination across communities, the advent of large language models (LLMs)
and translation systems offers promising tools to facilitate this goal.
However, we have identified a scarcity of translation datasets in Classical
Arabic, which are often limited in scope and topics, hindering the development
of high-quality translation systems. In response, we present the ATHAR dataset,
comprising 66,000 high-quality Classical Arabic to English translation samples
that cover a wide array of subjects including science, culture, and philosophy.
Furthermore, we assess the performance of current state-of-the-art LLMs under
various settings, concluding that there is a need for such datasets in current
systems. Our findings highlight how models can benefit from fine-tuning or
incorporating this dataset into their pretraining pipelines. The dataset is
publicly available on the HuggingFace Data Hub at
https://huggingface.co/datasets/mohamed-khalil/ATHAR.Summary
AI-Generated Summary