번역으로 회복된 벤치마크 및 데이터셋 자동 번역을 위한 효율적인 파이프라인

초록

다국어 대규모 언어 모델(LLM) 평가의 신뢰성은 현재 번역된 벤치마크의 불일치하는 품질로 인해 훼손되고 있습니다. 기존 자원은 종종 의미 변이와 문맥 손실 문제를 겪어 왜곡된 성능 지표를 초래할 수 있습니다. 본 연구에서는 이러한 문제를 해결하기 위해 확장 가능한 고품질 데이터셋 및 벤치마크 번역을 가능하게 하는 완전 자동화 프레임워크를 제시합니다. 우리는 테스트 시점 계산 규모 조정 전략, 특히 범용 자기 개선(USI)과 우리가 제안하는 다중 라운드 순위 지정 방법인 T-RANK를 적용함으로써 기존 파이프라인 대비 월등히 높은 품질의 출력을 얻을 수 있음을 입증합니다. 우리의 프레임워크는 벤치마크가 현지화 과정에서 원본 작업 구조와 언어적 뉘앙스를 보존하도록 보장합니다. 우리는 이 접근법을 적용하여 주요 벤치마크와 데이터셋을 8개의 동유럽 및 남유럽 언어(우크라이나어, 불가리아어, 슬로바키아어, 루마니아어, 리투아니아어, 에스토니아어, 튀르키예어, 그리스어)로 번역했습니다. 참조 기반 메트릭과 LLM-as-a-judge를 활용한 평가 결과, 우리의 번역이 기존 자원을 능가하여 더 정확한 하위 모델 평가를 가능하게 함을 확인했습니다. 우리는 강건하고 재현 가능한 다국어 AI 개발을 지원하기 위해 프레임워크와 개선된 벤치마크를 모두 공개합니다.

English

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

번역으로 회복된 벤치마크 및 데이터셋 자동 번역을 위한 효율적인 파이프라인

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

초록

Support