SwiLTra-Bench：瑞士法律翻譯基準測試

摘要

在瑞士，法律翻譯具有獨特的重要性，這源於該國的四種官方語言以及對多語種法律文件的要求。然而，這一過程傳統上依賴於既需具備法律專業知識又需精通翻譯技能的專業人士——這造成了瓶頸，影響了有效獲取司法公正的途徑。為應對這一挑戰，我們推出了SwiLTra-Bench，這是一個包含超過18萬對瑞士法律翻譯對的全面多語種基準，涵蓋了所有瑞士語言及英語的法律、摘要和新聞稿，旨在評估基於大型語言模型（LLM）的翻譯系統。我們的系統性評估顯示，前沿模型在所有文件類型上均展現出卓越的翻譯性能，而專門的翻譯系統在法律文本上表現尤為突出，但在摘要翻譯上則稍顯遜色。通過嚴格的測試和人類專家驗證，我們證實，儘管對開源SLM進行微調能顯著提升其翻譯質量，但它們仍落後於如Claude-3.5-Sonnet等最佳零樣本提示的前沿模型。此外，我們還介紹了SwiLTra-Judge，這是一個與人類專家評估最為契合的專用LLM評估系統。

English

In Switzerland legal translation is uniquely important due to the country's four official languages and requirements for multilingual legal documentation. However, this process traditionally relies on professionals who must be both legal experts and skilled translators -- creating bottlenecks and impacting effective access to justice. To address this challenge, we introduce SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems. Our systematic evaluation reveals that frontier models achieve superior translation performance across all document types, while specialized translation systems excel specifically in laws but under-perform in headnotes. Through rigorous testing and human expert validation, we demonstrate that while fine-tuning open SLMs significantly improves their translation quality, they still lag behind the best zero-shot prompted frontier models such as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM evaluation system that aligns best with human expert assessments.