BIRD-INTERACT: 동적 상호작용의 관점에서 대규모 언어 모델을 위한 Text-to-SQL 평가의 재구상

초록

대규모 언어 모델(LLMs)은 단일 턴 텍스트-to-SQL 작업에서 뛰어난 성능을 보여주었지만, 실제 데이터베이스 애플리케이션에서는 모호한 쿼리, 실행 오류, 그리고 변화하는 사용자 요구사항을 처리하기 위해 주로 다중 턴 상호작용이 필요합니다. 기존의 다중 턴 벤치마크는 대화 기록을 정적 컨텍스트로 취급하거나 읽기 전용 작업으로 평가를 제한함으로써, 프로덕션급 데이터베이스 어시스턴트의 도전 과제를 충분히 반영하지 못하고 있습니다. 우리는 BIRD-INTERACT를 소개합니다. 이 벤치마크는 다음과 같은 방식으로 현실성을 복원합니다: (1) 각 데이터베이스와 계층적 지식 베이스, 메타데이터 파일, 그리고 함수 기반 사용자 시뮬레이터를 결합한 포괄적인 상호작용 환경을 제공하여, 모델이 인간의 감독 없이도 명확화를 요청하고 지식을 검색하며 오류에서 복구할 수 있도록 합니다; (2) 사전 정의된 대화 프로토콜(c-Interact)과 모델이 사용자 시뮬레이터에 쿼리하거나 환경을 탐색할 시기를 자율적으로 결정하는 개방형 에이전트 설정(a-Interact)으로 구성된 두 가지 평가 설정; (3) 비즈니스 인텔리전스 및 운영 사용 사례를 위한 전체 CRUD 스펙트럼을 다루는 도전적인 작업 세트로, 실행 가능한 테스트 케이스로 보호됩니다. 각 작업은 동적 상호작용을 요구하는 모호하고 후속 작업을 포함합니다. 이 세트는 포괄적인 성능 평가를 위한 BIRD-INTERACT-FULL(600개 작업, 최대 11,796회 상호작용)과 상세한 행동 분석 및 빠른 방법 개발을 위한 BIRD-INTERACT-LITE(300개 작업, 단순화된 데이터베이스)로 구성됩니다. 우리의 실험 결과는 BIRD-INTERACT의 난이도를 강조합니다: GPT-5는 c-Interact에서 8.67%, a-Interact에서 17.00%의 작업만 완료했습니다. 메모리 그래프팅과 상호작용 테스트-타임 스케일링을 통한 분석은 복잡하고 동적인 텍스트-to-SQL 작업에 효과적인 상호작용의 중요성을 검증합니다.

English

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.

BIRD-INTERACT: 동적 상호작용의 관점에서 대규모 언어 모델을 위한 Text-to-SQL 평가의 재구상

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

초록

Support