ChatPaper.aiChatPaper

BIRD-INTERACT:通过动态交互视角重构大型语言模型的文本到SQL评估

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

October 6, 2025
作者: Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Ozcan, Yannis Papakonstantinou, Reynold Cheng
cs.AI

摘要

大型语言模型(LLMs)在单轮文本到SQL任务中展现了卓越的性能,然而现实世界的数据库应用主要依赖于多轮交互来处理模糊查询、执行错误以及不断变化的用户需求。现有的多轮基准测试存在不足,它们将对话历史视为静态上下文或仅限于只读操作评估,未能反映生产级数据库助手所面临的挑战。为此,我们引入了BIRD-INTERACT基准,通过以下方式恢复真实感:(1)构建一个综合交互环境,将每个数据库与层次化知识库、元数据文件及函数驱动的用户模拟器相结合,使模型能在无需人工监督的情况下请求澄清、检索知识并从错误中恢复;(2)提供两种评估设置,包括预定义的对话协议(c-Interact)和开放式代理设置(a-Interact),在后者中模型自主决定何时查询用户模拟器或探索环境;(3)设计一套涵盖业务智能和操作用例全CRUD(增删改查)范围的挑战性任务集,并配备可执行的测试用例进行保护。每项任务均包含需要动态交互的模糊及后续子任务。该套件包含BIRD-INTERACT-FULL(600项任务,最多11,796次交互)用于全面性能评估,以及BIRD-INTERACT-LITE(300项任务,采用简化数据库)用于详细行为分析和快速方法开发。我们的实证结果凸显了BIRD-INTERACT的难度:GPT-5在c-Interact中仅完成8.67%的任务,在a-Interact中完成17.00%。通过记忆嫁接和交互测试时缩放分析,验证了有效交互对于复杂、动态文本到SQL任务的重要性。
English
Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.
PDF142October 8, 2025