SWE-SQL: 実世界アプリケーションにおけるユーザーのSQL問題を解決するためのLLMの道筋を照らす

要旨

複雑なSQL問題の解決は、現実世界のデータベースアプリケーションにおいて依然として重大なボトルネックとなっています。現在の大規模言語モデル（LLMs）は、テキストからSQLへの翻訳に熟練しているものの、より困難なSQL問題のデバッグタスクについては厳密に評価されていません。このギャップを埋めるため、我々はBIRD-CRITICを導入しました。これは、530のPostgreSQLタスク（BIRD-CRITIC-PG）と570のマルチダイアレクトタスク（BIRD-CRITIC-Multi）からなる新しいSQL問題デバッグベンチマークで、実際のユーザー問題から抽出され、新しい環境で再現されることで厳密な評価を可能にします。ベースライン評価はこのタスクの複雑さを浮き彫りにしており、主要な推論モデルであるO3-Miniは、BIRD-CRITIC-PGで38.87%、BIRD-CRITIC-Multiで33.33%の成功率しか達成できませんでした。一方、データベースタスクのためのオープンソースモデルの進展は、ローカル開発を強化しつつデータプライバシーを保護するために重要です。そこで我々は、SQL問題デバッグのためのオープンソースモデル能力を向上させるトレーニング環境であるSix-Gym（Sql-fIX-Gym）を提案します。この環境は、検証済みSQLから問題をリバースエンジニアリングすることで実行可能な問題解決データセットを自動生成するSQL-Rewind戦略を活用します。しかし、人気のある軌跡ベースのファインチューニング手法は、十分な監督信号を探索しません。我々はさらに、SQLソリューションから高レベルのデバッグプランを抽出し、教師LLMがトレーニング用の成功軌跡を73.7%多く生成できるようにするf-Plan Boostingを提案します。これらのコンポーネントをオープンソースエージェントであるBird-Fixerに統合しました。Qwen-2.5-Coder-14BをベースにしたBird-Fixerは、BIRD-CRITIC-PGで38.11%、BIRD-CRITIC-Multiで29.65%の成功率を達成し、Claude-3.7-SonnetやGPT-4.1などの主要なプロプライエタリモデルを上回り、高度なSQLデバッグ能力の民主化に向けた重要な一歩を記しました。リーダーボードとソースコードは以下で利用可能です: https://bird-critic.github.io/

English

Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/

SWE-SQL: 実世界アプリケーションにおけるユーザーのSQL問題を解決するためのLLMの道筋を照らす

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

要旨

Support