SwiLTra-Bench：スイス法務翻訳ベンチマーク

要旨

スイスでは、4つの公用語と多言語での法的文書作成が求められることから、法律翻訳が特に重要です。しかし、このプロセスは伝統的に、法律の専門知識と翻訳スキルの両方を兼ね備えた専門家に依存しており、ボトルネックが生じ、司法への効果的なアクセスに影響を与えています。この課題に対処するため、私たちはSwiLTra-Benchを紹介します。これは、スイスのすべての言語と英語を含む、法律、判例要旨、プレスリリースからなる18万以上の整列したスイス法律翻訳ペアを網羅した包括的な多言語ベンチマークで、LLMベースの翻訳システムを評価するために設計されています。私たちの体系的な評価により、最先端モデルがすべての文書タイプで優れた翻訳性能を達成する一方で、専門の翻訳システムは法律において特に優れているものの、判例要旨では性能が低いことが明らかになりました。厳密なテストと人間の専門家による検証を通じて、オープンなSLMをファインチューニングすることで翻訳品質が大幅に向上するものの、Claude-3.5-Sonnetのような最高のゼロショットプロンプト付き最先端モデルにはまだ及ばないことを示しました。さらに、人間の専門家の評価に最も適合する専門のLLM評価システムであるSwiLTra-Judgeを紹介します。

English

In Switzerland legal translation is uniquely important due to the country's four official languages and requirements for multilingual legal documentation. However, this process traditionally relies on professionals who must be both legal experts and skilled translators -- creating bottlenecks and impacting effective access to justice. To address this challenge, we introduce SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems. Our systematic evaluation reveals that frontier models achieve superior translation performance across all document types, while specialized translation systems excel specifically in laws but under-perform in headnotes. Through rigorous testing and human expert validation, we demonstrate that while fine-tuning open SLMs significantly improves their translation quality, they still lag behind the best zero-shot prompted frontier models such as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM evaluation system that aligns best with human expert assessments.