推出TrGLUE与SentiTurca：土耳其语通用语言理解与情感分析综合基准评测体系

摘要

评估各类模型架构（如Transformer、大语言模型及其他自然语言处理系统）的性能需要能够多维度衡量的综合基准。其中，自然语言理解能力的评估尤为关键，因其是衡量模型能力的核心标准。因此，建立能从多视角全面评估分析NLU能力的基准体系至关重要。尽管GLUE基准为英语NLU评估树立了标杆，其他语言也相继开发了类似基准——如中文CLUE、法语FLUE和日文JGLUE，但目前土耳其语仍缺乏可比拟的评估基准。为填补这一空白，我们推出土耳其语综合基准TrGLUE，涵盖多种NLU任务，并专门针对情感分析提出SentiTurca基准。为支持研究者，我们还提供了基于Transformer模型的微调与评估代码，以促进这些基准的有效使用。TrGLUE包含精心构建的土耳其语原生语料库，其设计思路延续GLUE式评估的领域覆盖与任务框架，标签获取采用结合强LLM自动标注、跨模型一致性校验及人工验证的半自动化流程。该设计优先保证语言自然度，最大限度减少直接翻译痕迹，形成可扩展、可复现的工作流。通过TrGLUE，我们旨在为土耳其语NLU建立稳健的评估框架，为研究者提供宝贵资源，并为生成高质量半自动化数据集提供方法论参考。

English

Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.

推出TrGLUE与SentiTurca：土耳其语通用语言理解与情感分析综合基准评测体系

Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

摘要

Support