SWE-SQL:照亮大语言模型解决实际应用中用户SQL问题的路径
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
June 23, 2025
作者: Jinyang Li, Xiaolong Li, Ge Qu, Per Jacobsson, Bowen Qin, Binyuan Hui, Shuzheng Si, Nan Huo, Xiaohan Xu, Yue Zhang, Ziwei Tang, Yuanshuai Li, Florensia Widjaja, Xintong Zhu, Feige Zhou, Yongfeng Huang, Yannis Papakonstantinou, Fatma Ozcan, Chenhao Ma, Reynold Cheng
cs.AI
摘要
在现实世界的数据库应用中,解决复杂的SQL问题仍然是一个显著的瓶颈。当前的大型语言模型(LLMs)虽然在文本到SQL的翻译方面表现出色,但尚未在更具挑战性的SQL问题调试任务上得到严格评估。为填补这一空白,我们引入了BIRD-CRITIC,这是一个新的SQL问题调试基准,包含530个PostgreSQL任务(BIRD-CRITIC-PG)和570个多方言任务(BIRD-CRITIC-Multi),这些任务均提炼自真实用户问题,并在新环境中重放,以便进行严格评估。基线评估凸显了任务的复杂性,领先的推理模型O3-Mini在BIRD-CRITIC-PG上仅达到38.87%的成功率,在BIRD-CRITIC-Multi上为33.33%。同时,推动开源模型在数据库任务上的进步,对于赋能本地开发并保障数据隐私至关重要。因此,我们推出了Six-Gym(Sql-fIX-Gym),这是一个训练环境,旨在提升开源模型在SQL问题调试上的能力。该环境采用SQL-Rewind策略,通过从已验证的SQL反向工程生成可执行的问题-解决方案数据集。然而,流行的基于轨迹的微调方法并未深入挖掘有效的监督信号。我们进一步提出了f-Plan Boosting,它从SQL解决方案中提取高级调试计划,使教师LLMs能够生成73.7%更多成功的训练轨迹。我们将这些组件集成到一个开源代理Bird-Fixer中。基于Qwen-2.5-Coder-14B,Bird-Fixer在BIRD-CRITIC-PG上实现了38.11%的成功率,在BIRD-CRITIC-Multi上为29.65%,超越了Claude-3.7-Sonnet和GPT-4.1等领先的专有模型,标志着向普及复杂SQL调试能力迈出了重要一步。排行榜和源代码可访问:https://bird-critic.github.io/
English
Resolution of complex SQL issues persists as a significant bottleneck in
real-world database applications. Current Large Language Models (LLMs), while
adept at text-to-SQL translation, have not been rigorously evaluated on the
more challenging task of debugging SQL issues. To address this gap, we
introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530
PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks
(BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within
new environments to facilitate rigorous evaluation. Baseline evaluations
underscore the task's complexity, with the leading reasoning model O3-Mini
achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on
BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks
is crucial for empowering local development while safeguarding data privacy.
Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for
elevating open-source model capabilities for SQL issue debugging. This
environment leverages SQL-Rewind strategy, which automatically generates
executable issue-solution datasets by reverse-engineering issues from verified
SQLs. However, popular trajectory-based fine-tuning methods do not explore
substantial supervisory signals. We further propose f-Plan Boosting, which
extracts high-level debugging plans from SQL solutions, enabling teacher LLMs
to produce 73.7% more successful trajectories for training. We integrate these
components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B,
Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on
BIRD-CRITIC-Multi, surpassing leading proprietary models such as
Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing
sophisticated SQL-debugging capabilities. The leaderboard and source code are
available: https://bird-critic.github.io/