翻譯中復原：基準與資料集自動化翻譯的高效流程

摘要

當前多語言大型語言模型評估的可信度，正因翻譯基準測試的質量參差不齊而受到影響。現有資源普遍存在語義偏移和上下文缺失問題，可能導致性能指標失真。本研究提出一個全自動化框架，通過實現可擴展的高質量數據集與基準測試翻譯，從根本上解決這些挑戰。我們證明，採用測試時計算擴展策略——特別是通用自我改進技術與我們提出的多輪排序方法T-RANK——相較傳統流程能顯著提升輸出質量。該框架能確保基準測試在本地化過程中保持原始任務結構與語言細微特徵。我們運用此方法將多個主流基準測試翻譯為八種東歐與南歐語言（烏克蘭語、保加利亞語、斯洛伐克語、羅馬尼亞語、立陶宛語、愛沙尼亞語、土耳其語、希臘語）。基於參考指標與LLM評判機制的雙重評估表明，我們的翻譯成果超越現有資源，能實現更精準的下游模型評估。我們同步開源此框架與改進後的基準測試，以促進健壯且可重現的多語言人工智能發展。

English

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

翻譯中復原：基準與資料集自動化翻譯的高效流程

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

摘要

Support