SWE-SQL：照亮大语言模型解决实际应用中用户SQL问题的路径

摘要

在现实世界的数据库应用中，解决复杂的SQL问题仍然是一个显著的瓶颈。当前的大型语言模型（LLMs）虽然在文本到SQL的翻译方面表现出色，但尚未在更具挑战性的SQL问题调试任务上得到严格评估。为填补这一空白，我们引入了BIRD-CRITIC，这是一个新的SQL问题调试基准，包含530个PostgreSQL任务（BIRD-CRITIC-PG）和570个多方言任务（BIRD-CRITIC-Multi），这些任务均提炼自真实用户问题，并在新环境中重放，以便进行严格评估。基线评估凸显了任务的复杂性，领先的推理模型O3-Mini在BIRD-CRITIC-PG上仅达到38.87%的成功率，在BIRD-CRITIC-Multi上为33.33%。同时，推动开源模型在数据库任务上的进步，对于赋能本地开发并保障数据隐私至关重要。因此，我们推出了Six-Gym（Sql-fIX-Gym），这是一个训练环境，旨在提升开源模型在SQL问题调试上的能力。该环境采用SQL-Rewind策略，通过从已验证的SQL反向工程生成可执行的问题-解决方案数据集。然而，流行的基于轨迹的微调方法并未深入挖掘有效的监督信号。我们进一步提出了f-Plan Boosting，它从SQL解决方案中提取高级调试计划，使教师LLMs能够生成73.7%更多成功的训练轨迹。我们将这些组件集成到一个开源代理Bird-Fixer中。基于Qwen-2.5-Coder-14B，Bird-Fixer在BIRD-CRITIC-PG上实现了38.11%的成功率，在BIRD-CRITIC-Multi上为29.65%，超越了Claude-3.7-Sonnet和GPT-4.1等领先的专有模型，标志着向普及复杂SQL调试能力迈出了重要一步。排行榜和源代码可访问：https://bird-critic.github.io/

English

Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/

SWE-SQL：照亮大语言模型解决实际应用中用户SQL问题的路径

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

摘要

Support