法律检索增强生成基准：一个端到端的法律RAG评估体系

摘要

我们推出Legal RAG Bench——一个用于评估法律RAG系统端到端性能的基准测试与评估方法。该基准包含从《维多利亚刑事指控手册》中提取的4,876个法律条文片段，以及100个需要刑法与诉讼程序专业知识的复杂人工编制问题，同时提供详述答案及支撑性法律依据。在评估方法上，Legal RAG Bench采用全因子实验设计及创新的层次化误差分解框架，实现了检索模型与推理模型在RAG系统中贡献度的精准对比。通过对三款前沿嵌入模型（Isaacus公司的Kanon 2 Embedder、谷歌Gemini Embedding 001和OpenAI Text Embedding 3 Large）及两款尖端大语言模型（Gemini 3.1 Pro与GPT-5.2）的评估，我们发现信息检索是法律RAG性能的核心驱动力，而大语言模型对答案正确性与事实依据性的影响相对有限。其中Kanon 2 Embedder对性能提升最为显著，使平均正确率提高17.5个百分点，事实依据性提升4.5个百分点，检索准确率提升34个百分点。研究观察到法律RAG系统中许多被归因于幻觉生成的错误实则由检索失败引发，由此得出结论：检索性能为现代法律RAG系统的表现设定了上限。本文详细阐述了构建Legal RAG Bench的动因、方法及评估结果，并公开代码与数据以助力研究复现。

English

We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.