ChatPaper.aiChatPaper

Turk-LettuceDetect:面向土耳其语RAG应用的幻觉检测模型

Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications

September 22, 2025
作者: Selva Taş, Mahmut El Huseyni, Özay Ezerceli, Reyhan Bayraktar, Fatma Betül Terzioğlu
cs.AI

摘要

大规模语言模型(LLMs)的广泛应用因其倾向于产生幻觉——生成看似合理但事实错误的信息——而受到阻碍。尽管检索增强生成(RAG)系统试图通过将响应基于外部知识来解决这一问题,但幻觉现象仍然是一个持续存在的挑战,特别是对于形态复杂、资源匮乏的语言如土耳其语。本文介绍了Turk-LettuceDetect,这是首个专为土耳其语RAG应用设计的幻觉检测模型套件。基于LettuceDetect框架,我们将幻觉检测构建为一项令牌级分类任务,并对三种不同的编码器架构进行了微调:专门针对土耳其语的ModernBERT、TurkEmbed4STS以及多语言的EuroBERT。这些模型在机器翻译的RAGTruth基准数据集上进行了训练,该数据集包含17,790个实例,涵盖问答、数据到文本生成和摘要任务。实验结果显示,基于ModernBERT的模型在整个测试集上达到了0.7266的F1分数,在结构化任务上表现尤为突出。这些模型在支持长达8,192个令牌的长上下文的同时保持了计算效率,使其适合实时部署。对比分析表明,尽管最先进的LLMs展现出高召回率,但由于过度生成幻觉内容,其精确度较低,这凸显了专门检测机制的必要性。通过发布我们的模型和翻译后的数据集,这项工作填补了多语言自然语言处理中的一个关键空白,并为开发更可靠、值得信赖的土耳其语及其他语言AI应用奠定了基础。
English
The widespread adoption of Large Language Models (LLMs) has been hindered by their tendency to hallucinate, generating plausible but factually incorrect information. While Retrieval-Augmented Generation (RAG) systems attempt to address this issue by grounding responses in external knowledge, hallucination remains a persistent challenge, particularly for morphologically complex, low-resource languages like Turkish. This paper introduces Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications. Building on the LettuceDetect framework, we formulate hallucination detection as a token-level classification task and fine-tune three distinct encoder architectures: a Turkish-specific ModernBERT, TurkEmbed4STS, and multilingual EuroBERT. These models were trained on a machine-translated version of the RAGTruth benchmark dataset containing 17,790 instances across question answering, data-to-text generation, and summarization tasks. Our experimental results show that the ModernBERT-based model achieves an F1-score of 0.7266 on the complete test set, with particularly strong performance on structured tasks. The models maintain computational efficiency while supporting long contexts up to 8,192 tokens, making them suitable for real-time deployment. Comparative analysis reveals that while state-of-the-art LLMs demonstrate high recall, they suffer from low precision due to over-generation of hallucinated content, underscoring the necessity of specialized detection mechanisms. By releasing our models and translated dataset, this work addresses a critical gap in multilingual NLP and establishes a foundation for developing more reliable and trustworthy AI applications for Turkish and other languages.
PDF62September 23, 2025