ChatPaper.aiChatPaper

法律检索增强生成基准:面向法律RAG的端到端评估体系

Legal RAG Bench: an end-to-end benchmark for legal RAG

March 2, 2026
作者: Abdur-Rahman Butler, Umar Butler
cs.AI

摘要

我们推出Legal RAG Bench——一个用于评估法律RAG系统端到端性能的基准测试与评估方法论。作为基准测试体系,该平台包含来自《维多利亚州刑事指控手册》的4,876个法律条文片段,以及100个需要刑法与诉讼程序专业知识的复杂人工设计问题,同时提供详细解答及佐证条文。在评估方法论层面,Legal RAG Bench采用全因子实验设计及创新的层级化误差分解框架,实现了检索模型与推理模型在RAG系统中贡献度的精准对比。通过对三款前沿嵌入模型(Isaacus公司的Kanon 2 Embedder、谷歌Gemini Embedding 001和OpenAI的Text Embedding 3 Large)及两大顶尖大语言模型(Gemini 3.1 Pro与GPT-5.2)的评估,我们发现信息检索是法律RAG性能的核心驱动因素,而大语言模型对答案正确性与事实依据性的影响相对有限。其中Kanon 2 Embedder对性能提升贡献最大,使平均正确率提高17.5个百分点、事实依据性提升4.5个百分点、检索准确率提升34个百分点。研究观察到法律RAG系统中许多被归因于模型幻觉的错误实则由检索失败引发,表明检索质量决定了现代法律RAG系统的性能上限。本文详细阐述了构建Legal RAG Bench的动因与方法,并公开代码与数据以助力研究复现。
English
We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
PDF41March 4, 2026