Turk-LettuceDetect: 튀르키예어 RAG 애플리케이션을 위한 환각 감지 모델

초록

대규모 언어 모델(LLMs)의 광범위한 채택은 사실적으로 부정확하지만 그럴듯한 정보를 생성하는 환각(hallucination) 현상으로 인해 제한을 받아왔습니다. 검색 증강 생성(Retrieval-Augmented Generation, RAG) 시스템은 외부 지식을 기반으로 응답을 근거지어 이 문제를 해결하려고 시도하지만, 특히 터키어와 같은 형태학적으로 복잡하고 자원이 부족한 언어의 경우 환각 현상은 여전히 지속적인 과제로 남아 있습니다. 본 논문은 터키어 RAG 애플리케이션을 위해 특별히 설계된 최초의 환각 탐지 모델 세트인 Turk-LettuceDetect를 소개합니다. LettuceDetect 프레임워크를 기반으로, 우리는 환각 탐지를 토큰 수준의 분류 작업으로 공식화하고 세 가지 독특한 인코더 아키텍처를 미세 조정했습니다: 터키어 특화 ModernBERT, TurkEmbed4STS, 그리고 다국어 EuroBERT입니다. 이 모델들은 질문 응답, 데이터-텍스트 생성, 요약 작업을 포함한 17,790개의 인스턴스로 구성된 RAGTruth 벤치마크 데이터셋의 기계 번역 버전으로 학습되었습니다. 실험 결과, ModernBERT 기반 모델은 전체 테스트 세트에서 0.7266의 F1 점수를 달성했으며, 특히 구조화된 작업에서 강력한 성능을 보였습니다. 이 모델들은 최대 8,192 토큰까지의 긴 문맥을 지원하면서도 계산 효율성을 유지하여 실시간 배포에 적합합니다. 비교 분석 결과, 최신 LLMs는 높은 재현율(recall)을 보이지만 환각된 내용을 과도하게 생성함으로써 정밀도(precision)가 낮아, 전문화된 탐지 메커니즘의 필요성을 강조합니다. 우리의 모델과 번역된 데이터셋을 공개함으로써, 이 연구는 다국어 NLP에서의 중요한 격차를 해결하고 터키어 및 기타 언어를 위한 더 신뢰할 수 있는 AI 애플리케이션 개발의 기반을 마련합니다.

English

The widespread adoption of Large Language Models (LLMs) has been hindered by their tendency to hallucinate, generating plausible but factually incorrect information. While Retrieval-Augmented Generation (RAG) systems attempt to address this issue by grounding responses in external knowledge, hallucination remains a persistent challenge, particularly for morphologically complex, low-resource languages like Turkish. This paper introduces Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications. Building on the LettuceDetect framework, we formulate hallucination detection as a token-level classification task and fine-tune three distinct encoder architectures: a Turkish-specific ModernBERT, TurkEmbed4STS, and multilingual EuroBERT. These models were trained on a machine-translated version of the RAGTruth benchmark dataset containing 17,790 instances across question answering, data-to-text generation, and summarization tasks. Our experimental results show that the ModernBERT-based model achieves an F1-score of 0.7266 on the complete test set, with particularly strong performance on structured tasks. The models maintain computational efficiency while supporting long contexts up to 8,192 tokens, making them suitable for real-time deployment. Comparative analysis reveals that while state-of-the-art LLMs demonstrate high recall, they suffer from low precision due to over-generation of hallucinated content, underscoring the necessity of specialized detection mechanisms. By releasing our models and translated dataset, this work addresses a critical gap in multilingual NLP and establishes a foundation for developing more reliable and trustworthy AI applications for Turkish and other languages.

Turk-LettuceDetect: 튀르키예어 RAG 애플리케이션을 위한 환각 감지 모델

Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications

초록

Support