R^3-SQL: ランキング報酬とリサンプリングに基づくText-to-SQL

要旨

現代のText-to-SQLシステムは、複数の候補SQLクエリを生成し、それらをランク付けして最終的な予測を判断する。しかし、既存手法には二つの限界がある。第一に、実行結果が同一であるにもかかわらず、機能的に等価なSQLクエリに対して一貫性のないスコアを与えることが多い。第二に、正しいSQLが候補プールに存在しない場合、ランキングでは回復できない。我々はR^3-SQLを提案する。これはランク付けと再サンプリングのための統一報酬を通じて両方の問題に対処するText-to-SQLフレームワークである。R^3-SQLはまず候補を実行結果でグループ化し、一貫性のためにグループをランク付けする。各グループをスコアリングするために、グループ間のペアワイズ選好と、最良グループのランクとサイズからのポイントワイズ効用を組み合わせ、相対的な選好、一貫性、候補品質を捉える。候補の再現率を改善するために、R^3-SQLはエージェント型再サンプリングを導入し、生成された候補プールを判断し、正しいSQLが存在しない可能性が高い場合に選択的に再サンプリングする。R^3-SQLはBIRD-devにおいて75.03の実行精度を達成し、開示されたサイズのモデルを用いた手法の中で新たな最先端となり、5つのベンチマークで一貫した改善を示す。

English

Modern Text-to-SQL systems generate multiple candidate SQL queries and rank them to judge a final prediction. However, existing methods face two limitations. First, they often score functionally equivalent SQL queries inconsistently despite identical execution results. Second, ranking cannot recover when the correct SQL is absent from the candidate pool. We propose R^3-SQL, a Text-to-SQL framework that addresses both issues through unified reward for ranking and resampling. R^3-SQL first groups candidates by execution result and ranks groups for consistency. To score each group, it combines a pairwise preference across groups with a pointwise utility from the best group rank and size, capturing relative preference, consistency, and candidate quality. To improve candidate recall, R^3-SQL introduces agentic resampling, which judges the generated candidate pool and selectively resamples when the correct SQL is likely absent. R^3-SQL achieves 75.03 execution accuracy on BIRD-dev, a new state of the art among methods using models with disclosed sizes, with consistent gains across five benchmarks.