Turk-LettuceDetect:土耳其語RAG應用的幻覺檢測模型
Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications
September 22, 2025
作者: Selva Taş, Mahmut El Huseyni, Özay Ezerceli, Reyhan Bayraktar, Fatma Betül Terzioğlu
cs.AI
摘要
大型語言模型(LLMs)的廣泛應用一直受到其傾向於產生幻覺的阻礙,即生成看似合理但實際上不正確的資訊。雖然檢索增強生成(RAG)系統試圖通過將回應基於外部知識來解決這一問題,但幻覺仍然是一個持續存在的挑戰,特別是對於形態複雜、資源匱乏的語言如土耳其語。本文介紹了Turk-LettuceDetect,這是首個專門為土耳其語RAG應用設計的幻覺檢測模型套件。基於LettuceDetect框架,我們將幻覺檢測制定為一個詞元級別的分類任務,並微調了三種不同的編碼器架構:土耳其語專用的ModernBERT、TurkEmbed4STS和多語言的EuroBERT。這些模型在機器翻譯的RAGTruth基準數據集上進行了訓練,該數據集包含17,790個實例,涵蓋問答、數據到文本生成和摘要任務。我們的實驗結果顯示,基於ModernBERT的模型在完整測試集上達到了0.7266的F1分數,特別是在結構化任務上表現出色。這些模型在支持長達8,192個詞元的上下文時保持了計算效率,使其適合實時部署。比較分析表明,雖然最先進的LLMs展示了高召回率,但由於過度生成幻覺內容,其精確度較低,這凸顯了專門檢測機制的必要性。通過發布我們的模型和翻譯數據集,這項工作填補了多語言自然語言處理中的一個關鍵空白,並為開發更可靠和值得信賴的土耳其語及其他語言的人工智慧應用奠定了基礎。
English
The widespread adoption of Large Language Models (LLMs) has been hindered by
their tendency to hallucinate, generating plausible but factually incorrect
information. While Retrieval-Augmented Generation (RAG) systems attempt to
address this issue by grounding responses in external knowledge, hallucination
remains a persistent challenge, particularly for morphologically complex,
low-resource languages like Turkish. This paper introduces Turk-LettuceDetect,
the first suite of hallucination detection models specifically designed for
Turkish RAG applications. Building on the LettuceDetect framework, we formulate
hallucination detection as a token-level classification task and fine-tune
three distinct encoder architectures: a Turkish-specific ModernBERT,
TurkEmbed4STS, and multilingual EuroBERT. These models were trained on a
machine-translated version of the RAGTruth benchmark dataset containing 17,790
instances across question answering, data-to-text generation, and summarization
tasks. Our experimental results show that the ModernBERT-based model achieves
an F1-score of 0.7266 on the complete test set, with particularly strong
performance on structured tasks. The models maintain computational efficiency
while supporting long contexts up to 8,192 tokens, making them suitable for
real-time deployment. Comparative analysis reveals that while state-of-the-art
LLMs demonstrate high recall, they suffer from low precision due to
over-generation of hallucinated content, underscoring the necessity of
specialized detection mechanisms. By releasing our models and translated
dataset, this work addresses a critical gap in multilingual NLP and establishes
a foundation for developing more reliable and trustworthy AI applications for
Turkish and other languages.