翻译中的复原：高效自动化基准与数据集翻译流程

摘要

当前，多语言大语言模型（LLM）评估的可靠性因翻译基准的质量参差不齐而受到影响。现有资源常存在语义漂移和语境丢失问题，可能导致性能指标失真。本研究提出一种全自动框架，通过实现数据集与基准的高质量规模化翻译，以应对这些挑战。我们证明，采用测试时计算缩放策略——特别是通用自改进（USI）和我们提出的多轮排序方法T-RANK——相比传统流程能显著提升输出质量。该框架可确保基准在本地化过程中保留原始任务结构和语言细微差别。我们将此方法应用于八种东欧及南欧语言（乌克兰语、保加利亚语、斯洛伐克语、罗马尼亚语、立陶宛语、爱沙尼亚语、土耳其语、希腊语）的流行基准翻译。基于参考指标和LLM即评判器的评估表明，我们的翻译成果优于现有资源，能实现更精准的下游模型评估。我们同步发布该框架与优化后的基准数据集，以促进稳健可复现的多语言人工智能发展。

English

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

翻译中的复原：高效自动化基准与数据集翻译流程

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

摘要

Support