ROSE：一種面向意圖的自然語言轉SQL評估指標

摘要

當前廣泛用於評估自然語言轉SQL（NL2SQL）解決方案效能的執行準確率（EX）指標正逐漸失去可靠性。該指標對語法變化過於敏感，忽略問題可能存在的多重解讀方式，且易受錯誤基準SQL的誤導。為此，我們提出ROSE——一種以意圖為核心的評估指標，其關注點在於預測SQL是否正確解答問題，而非在參考依賴範式下與基準SQL的一致性。ROSE採用對抗式證明者-反駁者級聯架構：SQL證明者獨立評估預測SQL相對於用戶意圖的語義正確性，而對抗性反駁者則利用基準SQL作為證據來挑戰並優化此判斷。在我們與專家校準的驗證集ROSE-VEC上，ROSE與人類專家的判斷一致性最高，其Cohen's Kappa係數較次優指標提升近24%。我們還對19種NL2SQL方法進行大規模重評估，從中揭示四項重要發現。現公開釋出ROSE與ROSE-VEC，以推動更可靠的NL2SQL研究發展。

English

Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user's intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen's Kappa. We also conduct a largescale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.

ROSE：一種面向意圖的自然語言轉SQL評估指標

ROSE: An Intent-Centered Evaluation Metric for NL2SQL

摘要

Support