ChatPaper.aiChatPaper

DiscoX:专业领域篇章级翻译任务基准评测

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

November 14, 2025
作者: Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang
cs.AI

摘要

尽管语篇级翻译对知识传播和跨语言学术交流至关重要,但专业领域的语篇级翻译评估仍存在不足。这类翻译既要求语篇层面的连贯性,又需要严格的术语准确性,而现有评估方法主要聚焦于片段级的准确度和流畅度。为弥补这一缺陷,我们推出了DiscoX——一个面向专业领域汉英语篇翻译的新基准数据集。该数据集涵盖7个专业领域的200篇经专家审校的文本,平均长度超过1700个词符。针对DiscoX的评估需求,我们同时开发了无参考评估系统Metric-S,可从准确性、流畅度和适配性三个维度进行细粒度自动评测。Metric-S与人工评判保持高度一致,显著优于现有评估指标。实验结果显示存在显著性能差距:即使最先进的大语言模型在这些任务上仍落后于人类专家。这一发现验证了DiscoX的难度,也凸显了实现专业级机器翻译面临的挑战。本研究提出的基准数据集和评估体系为更严格的翻译质量评估提供了可靠框架,将推动基于大语言模型的翻译技术发展。
English
The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.
PDF42December 1, 2025