SwiLTra-Bench: 스위스 법률 번역 벤치마크

초록

스위스에서 법률 번역은 국가의 네 가지 공식 언어와 다국어 법률 문서 작성 요구 사항으로 인해 특히 중요합니다. 그러나 이 과정은 전통적으로 법률 전문가이자 숙련된 번역가여야 하는 전문가에 의존함으로써 병목 현상을 일으키고 사법 접근의 효율성에 영향을 미칩니다. 이러한 문제를 해결하기 위해, 우리는 모든 스위스 언어와 영어로 작성된 법률, 요약문, 보도 자료를 포함한 18만 개 이상의 정렬된 스위스 법률 번역 쌍으로 구성된 포괄적인 다국어 벤치마크인 SwiLTra-Bench를 소개합니다. 이 벤치마크는 LLM 기반 번역 시스템을 평가하기 위해 설계되었습니다. 우리의 체계적인 평가 결과, 최첨단 모델은 모든 문서 유형에서 우수한 번역 성능을 보이는 반면, 특화된 번역 시스템은 법률에서는 뛰어나지만 요약문에서는 성능이 떨어지는 것으로 나타났습니다. 엄격한 테스트와 인간 전문가 검증을 통해, 오픈 SLM을 미세 조정하면 번역 품질이 크게 개선되지만, 여전히 Claude-3.5-Sonnet과 같은 최고의 제로샷 프롬프트 최첨단 모델에 뒤처지는 것을 확인했습니다. 또한, 인간 전문가 평가와 가장 잘 일치하는 특화된 LLM 평가 시스템인 SwiLTra-Judge를 제시합니다.

English

In Switzerland legal translation is uniquely important due to the country's four official languages and requirements for multilingual legal documentation. However, this process traditionally relies on professionals who must be both legal experts and skilled translators -- creating bottlenecks and impacting effective access to justice. To address this challenge, we introduce SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems. Our systematic evaluation reveals that frontier models achieve superior translation performance across all document types, while specialized translation systems excel specifically in laws but under-perform in headnotes. Through rigorous testing and human expert validation, we demonstrate that while fine-tuning open SLMs significantly improves their translation quality, they still lag behind the best zero-shot prompted frontier models such as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM evaluation system that aligns best with human expert assessments.