ROSE：一种面向意图的自然语言转SQL评估指标

摘要

当前广泛使用的自然语言转SQL（NL2SQL）解决方案评估指标——执行准确率（EX）正日益显现其局限性。该指标对语法变化敏感，忽略了问题可能存在多重解释的可能性，且易受错误标注SQL的误导。为此，我们提出ROSE这一以意图为核心的评估指标，其关注点从参照依赖范式下的标注SQL一致性转向预测SQL是否真正解答了用户问题。ROSE采用对抗性的证明者-反驳者级联框架：SQL证明者独立评估预测SQL相对于用户意图的语义正确性，而对抗性反驳者则利用标注SQL作为证据对此判断进行挑战和优化。在专家对齐的验证集ROSE-VEC上，ROSE与人类专家的评估一致性达到最优，科恩卡帕系数较次优指标提升近24%。通过对19种NL2SQL方法的大规模重评估，我们进一步揭示了四项重要发现。现公开发布ROSE与ROSE-VEC，以推动更可靠的NL2SQL研究发展。

English

Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user's intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen's Kappa. We also conduct a largescale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.