ROSE: NL2SQLのための意図中心評価指標

要旨

自然言語からSQLへの変換（NL2SQL）ソリューションの有効性を評価するために広く用いられている実行精度（EX）は、信頼性が低下しつつあります。この指標は構文の差異に敏感であり、質問が複数の解釈を許容する可能性を無視し、誤った正解SQLに容易に誤導されるという問題があります。この課題に対処するため、我々は参照依存パラダイム下での正解SQLとの一致ではなく、予測されたSQLが質問に答えているかどうかに焦点を当てた意図中心の評価指標ROSEを提案します。ROSEは敵対的な証明者-反証者カスケードを採用しており、SQL証明者はユーザーの意意図に対して予測SQLの意味的正しさを独立して評価し、敵対的反証者は正解SQLを証拠として用いてこの判断に挑戦し洗練します。専門家調整済み検証セットROSE-VECにおいて、ROSEは人間の専門家との最高の一致度を達成し、コーエンのカッパ係数において次点の指標を約24%上回りました。さらに19のNL2SQL手法に対する大規模な再評価を実施し、4つの有益な知見を得ました。信頼性の高いNL2SQL研究を促進するため、ROSEおよびROSE-VECを公開します。

English

Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user's intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen's Kappa. We also conduct a largescale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.

ROSE: NL2SQLのための意図中心評価指標

ROSE: An Intent-Centered Evaluation Metric for NL2SQL

要旨

Support