BIRD-INTERACT:透過動態互動視角重新構想大型語言模型的文本到SQL評估
BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions
October 6, 2025
作者: Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Ozcan, Yannis Papakonstantinou, Reynold Cheng
cs.AI
摘要
大型語言模型(LLMs)在單輪文本到SQL任務中展現了卓越的性能,但現實世界的數據庫應用主要需要多輪互動來處理模糊查詢、執行錯誤和不斷變化的用戶需求。現有的多輪基準測試存在不足,它們將對話歷史視為靜態上下文或將評估限制在只讀操作上,未能反映生產級數據庫助手所面臨的挑戰。我們引入了BIRD-INTERACT,這是一個通過以下方式恢復真實性的基準測試:(1) 一個全面的互動環境,將每個數據庫與分層知識庫、元數據文件和功能驅動的用戶模擬器相結合,使模型能夠在無人監督的情況下請求澄清、檢索知識並從錯誤中恢復;(2) 兩種評估設置,包括預定義的對話協議(c-Interact)和開放式的代理設置(a-Interact),在後者中模型自主決定何時查詢用戶模擬器或探索環境;(3) 一個涵蓋業務智能和操作用例的完整CRUD範圍的挑戰性任務套件,並由可執行的測試用例保護。每個任務都包含需要動態互動的模糊和後續子任務。該套件包括BIRD-INTERACT-FULL(600個任務,最多11,796次互動)用於全面性能評估,以及BIRD-INTERACT-LITE(300個任務,簡化數據庫)用於詳細行為分析和快速方法開發。我們的實證結果突顯了BIRD-INTERACT的難度:GPT-5在c-Interact中僅完成8.67%的任務,在a-Interact中完成17.00%。通過記憶嫁接和互動測試時間縮放的分析,驗證了有效互動對於複雜、動態的文本到SQL任務的重要性。
English
Large language models (LLMs) have demonstrated remarkable performance on
single-turn text-to-SQL tasks, but real-world database applications
predominantly require multi-turn interactions to handle ambiguous queries,
execution errors, and evolving user requirements. Existing multi-turn
benchmarks fall short by treating conversation histories as static context or
limiting evaluation to read-only operations, failing to reflect
production-grade database assistant challenges. We introduce BIRD-INTERACT, a
benchmark that restores this realism through: (1) a comprehensive interaction
environment coupling each database with a hierarchical knowledge base, metadata
files, and a function-driven user simulator, enabling models to solicit
clarifications, retrieve knowledge, and recover from errors without human
supervision; (2) two evaluation settings consisting of a pre-defined
conversational protocol (c-Interact) and an open-ended agentic setting
(a-Interact) where models autonomously decide when to query the user simulator
or explore the environment; (3) a challenging task suite covering the full CRUD
spectrum for business-intelligence and operational use cases, guarded by
executable test cases. Each task features ambiguous and follow-up sub-tasks
requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600
tasks, up to 11,796 interactions) for comprehensive performance assessment, and
BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed
behavioral analysis and rapid method development. Our empirical results
highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in
c-Interact and 17.00% in a-Interact. Analysis via memory grafting and
Interaction Test-time Scaling validates the importance of effective interaction
for complex, dynamic text-to-SQL tasks.