Turk-LettuceDetect:面向土耳其语RAG应用的幻觉检测模型
Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications
September 22, 2025
作者: Selva Taş, Mahmut El Huseyni, Özay Ezerceli, Reyhan Bayraktar, Fatma Betül Terzioğlu
cs.AI
摘要
大规模语言模型(LLMs)的广泛应用因其倾向于产生幻觉——生成看似合理但事实错误的信息——而受到阻碍。尽管检索增强生成(RAG)系统试图通过将响应基于外部知识来解决这一问题,但幻觉现象仍然是一个持续存在的挑战,特别是对于形态复杂、资源匮乏的语言如土耳其语。本文介绍了Turk-LettuceDetect,这是首个专为土耳其语RAG应用设计的幻觉检测模型套件。基于LettuceDetect框架,我们将幻觉检测构建为一项令牌级分类任务,并对三种不同的编码器架构进行了微调:专门针对土耳其语的ModernBERT、TurkEmbed4STS以及多语言的EuroBERT。这些模型在机器翻译的RAGTruth基准数据集上进行了训练,该数据集包含17,790个实例,涵盖问答、数据到文本生成和摘要任务。实验结果显示,基于ModernBERT的模型在整个测试集上达到了0.7266的F1分数,在结构化任务上表现尤为突出。这些模型在支持长达8,192个令牌的长上下文的同时保持了计算效率,使其适合实时部署。对比分析表明,尽管最先进的LLMs展现出高召回率,但由于过度生成幻觉内容,其精确度较低,这凸显了专门检测机制的必要性。通过发布我们的模型和翻译后的数据集,这项工作填补了多语言自然语言处理中的一个关键空白,并为开发更可靠、值得信赖的土耳其语及其他语言AI应用奠定了基础。
English
The widespread adoption of Large Language Models (LLMs) has been hindered by
their tendency to hallucinate, generating plausible but factually incorrect
information. While Retrieval-Augmented Generation (RAG) systems attempt to
address this issue by grounding responses in external knowledge, hallucination
remains a persistent challenge, particularly for morphologically complex,
low-resource languages like Turkish. This paper introduces Turk-LettuceDetect,
the first suite of hallucination detection models specifically designed for
Turkish RAG applications. Building on the LettuceDetect framework, we formulate
hallucination detection as a token-level classification task and fine-tune
three distinct encoder architectures: a Turkish-specific ModernBERT,
TurkEmbed4STS, and multilingual EuroBERT. These models were trained on a
machine-translated version of the RAGTruth benchmark dataset containing 17,790
instances across question answering, data-to-text generation, and summarization
tasks. Our experimental results show that the ModernBERT-based model achieves
an F1-score of 0.7266 on the complete test set, with particularly strong
performance on structured tasks. The models maintain computational efficiency
while supporting long contexts up to 8,192 tokens, making them suitable for
real-time deployment. Comparative analysis reveals that while state-of-the-art
LLMs demonstrate high recall, they suffer from low precision due to
over-generation of hallucinated content, underscoring the necessity of
specialized detection mechanisms. By releasing our models and translated
dataset, this work addresses a critical gap in multilingual NLP and establishes
a foundation for developing more reliable and trustworthy AI applications for
Turkish and other languages.