BIRD-INTERACT：透過動態互動視角重新構想大型語言模型的文本到SQL評估

摘要

大型語言模型（LLMs）在單輪文本到SQL任務中展現了卓越的性能，但現實世界的數據庫應用主要需要多輪互動來處理模糊查詢、執行錯誤和不斷變化的用戶需求。現有的多輪基準測試存在不足，它們將對話歷史視為靜態上下文或將評估限制在只讀操作上，未能反映生產級數據庫助手所面臨的挑戰。我們引入了BIRD-INTERACT，這是一個通過以下方式恢復真實性的基準測試：(1) 一個全面的互動環境，將每個數據庫與分層知識庫、元數據文件和功能驅動的用戶模擬器相結合，使模型能夠在無人監督的情況下請求澄清、檢索知識並從錯誤中恢復；(2) 兩種評估設置，包括預定義的對話協議（c-Interact）和開放式的代理設置（a-Interact），在後者中模型自主決定何時查詢用戶模擬器或探索環境；(3) 一個涵蓋業務智能和操作用例的完整CRUD範圍的挑戰性任務套件，並由可執行的測試用例保護。每個任務都包含需要動態互動的模糊和後續子任務。該套件包括BIRD-INTERACT-FULL（600個任務，最多11,796次互動）用於全面性能評估，以及BIRD-INTERACT-LITE（300個任務，簡化數據庫）用於詳細行為分析和快速方法開發。我們的實證結果突顯了BIRD-INTERACT的難度：GPT-5在c-Interact中僅完成8.67%的任務，在a-Interact中完成17.00%。通過記憶嫁接和互動測試時間縮放的分析，驗證了有效互動對於複雜、動態的文本到SQL任務的重要性。

English

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.

BIRD-INTERACT：透過動態互動視角重新構想大型語言模型的文本到SQL評估

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

摘要

Support