SWE-SQL: 실제 애플리케이션에서 사용자 SQL 문제 해결을 위한 LLM 경로의 조명

초록

복잡한 SQL 문제 해결은 현실 세계의 데이터베이스 애플리케이션에서 여전히 주요 병목 현상으로 남아 있습니다. 현재의 대형 언어 모델(LLMs)은 텍스트-to-SQL 번역에는 능숙하지만, 더 도전적인 SQL 문제 디버깅 작업에 대해서는 엄격하게 평가되지 않았습니다. 이러한 격차를 해결하기 위해, 우리는 BIRD-CRITIC이라는 새로운 SQL 문제 디버깅 벤치마크를 소개합니다. 이 벤치마크는 실제 사용자 문제에서 추출된 530개의 PostgreSQL 작업(BIRD-CRITIC-PG)과 570개의 다중 방언 작업(BIRD-CRITIC-Multi)으로 구성되어 있으며, 엄격한 평가를 위해 새로운 환경에서 재현되었습니다. 베이스라인 평가는 이 작업의 복잡성을 강조하며, 선두 추론 모델인 O3-Mini는 BIRD-CRITIC-PG에서 38.87%, BIRD-CRITIC-Multi에서 33.33%의 성공률을 보였습니다. 한편, 데이터베이스 작업을 위한 오픈소스 모델의 발전은 지역 개발을 강화하고 데이터 프라이버시를 보호하는 데 중요합니다. 따라서, 우리는 SQL 문제 디버깅을 위한 오픈소스 모델 능력을 향상시키기 위한 훈련 환경인 Six-Gym(Sql-fIX-Gym)을 제시합니다. 이 환경은 검증된 SQL에서 문제를 역공학하여 실행 가능한 문제-해결 데이터셋을 자동으로 생성하는 SQL-Rewind 전략을 활용합니다. 그러나 인기 있는 궤적 기반 미세 조정 방법은 상당한 감독 신호를 탐구하지 않습니다. 우리는 더 나아가 SQL 솔루션에서 고수준 디버깅 계획을 추출하는 f-Plan Boosting을 제안하며, 이를 통해 교사 LLMs가 훈련을 위해 73.7% 더 성공적인 궤적을 생성할 수 있게 합니다. 우리는 이러한 구성 요소를 오픈소스 에이전트인 Bird-Fixer에 통합했습니다. Qwen-2.5-Coder-14B를 기반으로 한 Bird-Fixer는 BIRD-CRITIC-PG에서 38.11%, BIRD-CRITIC-Multi에서 29.65%의 성공률을 달성하며, Claude-3.7-Sonnet 및 GPT-4.1과 같은 선두 독점 모델을 능가하여, 정교한 SQL 디버깅 능력을 민주화하는 데 중요한 한 걸음을 내디뎠습니다. 리더보드와 소스 코드는 https://bird-critic.github.io/에서 확인할 수 있습니다.

English

Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/

SWE-SQL: 실제 애플리케이션에서 사용자 SQL 문제 해결을 위한 LLM 경로의 조명

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

초록

Support