Legal RAG Bench: un benchmark end-to-end per il RAG giuridico

Abstract

Presentiamo Legal RAG Bench, un benchmark e una metodologia di valutazione per analizzare le prestazioni end-to-end dei sistemi RAG (Retrieval-Augmented Generation) in ambito legale. Come benchmark, Legal RAG Bench è composto da 4.876 passaggi tratti dal Victorian Criminal Charge Book, affiancati da 100 domande complesse, elaborate manualmente, che richiedono una conoscenza specialistica del diritto e della procedura penale. Sono fornite sia risposte in forma estesa che i passaggi di supporto. Come metodologia di valutazione, Legal RAG Bench utilizza un design fattoriale completo e un nuovo framework di scomposizione gerarchica dell'errore, consentendo confronti diretti e omogenei dei contributi dei modelli di retrieval e di ragionamento all'interno dei sistemi RAG. Valutiamo tre modelli di embedding all'avanguardia (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001 e OpenAI's Text Embedding 3 Large) e due LLM di frontiera (Gemini 3.1 Pro e GPT-5.2), riscontrando che l'information retrieval è il principale fattore trainante delle prestazioni dei RAG legali, mentre gli LLM esercitano un effetto più moderato sulla correttezza e sulla groundedness (accuratezza fattuale). In particolare, Kanon 2 Embedder ha avuto l'impatto positivo più significativo sulle prestazioni, migliorando la correttezza media di 17,5 punti, la groundedness di 4,5 punti e l'accuratezza del retrieval di 34 punti. Osserviamo che molti errori attribuiti ad allucinazioni nei sistemi RAG legali sono in realtà scatenati da fallimenti nel retrieval, concludendo che il retrieval determina il limite superiore delle prestazioni per molti moderni sistemi RAG legali. Documentiamo le ragioni e le modalità con cui abbiamo costruito Legal RAG Bench insieme ai risultati delle nostre valutazioni. Rilasciamo inoltre apertamente il nostro codice e i nostri dati per facilitare la riproduzione dei nostri risultati.

English

We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.

Legal RAG Bench: un benchmark end-to-end per il RAG giuridico

Legal RAG Bench: an end-to-end benchmark for legal RAG

Abstract

Support