ROSE: Una Metrica di Valutazione Centrata sull'Intento per NL2SQL

Abstract

L'Execution Accuracy (EX), la metrica ampiamente utilizzata per valutare l'efficacia delle soluzioni di conversione dal Linguaggio Naturale a SQL (NL2SQL), sta diventando sempre più inaffidabile. È sensibile alle variazioni sintattiche, ignora il fatto che le domande possano ammettere molteplici interpretazioni e viene facilmente fuorviata da SQL di ground-truth errati. Per affrontare questo problema, introduciamo ROSE, una metrica incentrata sull'intento, che si concentra sul verificare se l'SQL predetto risponde alla domanda, piuttosto che sulla coerenza con l'SQL di ground-truth nel paradigma dipendente dal riferimento. ROSE utilizza una cascata avversaria Prover-Refuter: il SQL Prover valuta la correttezza semantica di un SQL predetto rispetto all'intento dell'utente in modo indipendente, mentre l'Adversarial Refuter utilizza l'SQL di ground-truth come prova per contestare e affinare questo giudizio. Sul nostro set di validazione allineato con esperti, ROSE-VEC, ROSE raggiunge il miglior accordo con esperti umani, superando la metrica successiva per prestazioni di quasi il 24% nel Kappa di Cohen. Effettuiamo inoltre una rivalutazione su larga scala di 19 metodi NL2SQL, rivelando quattro insight preziosi. Rilasciamo ROSE e ROSE-VEC per facilitare una ricerca NL2SQL più affidabile.

English

Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user's intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen's Kappa. We also conduct a largescale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.

ROSE: Una Metrica di Valutazione Centrata sull'Intento per NL2SQL

ROSE: An Intent-Centered Evaluation Metric for NL2SQL

Abstract

Support