BIRD-INTERACT：通过动态交互视角重构大型语言模型的文本到SQL评估

摘要

大型语言模型（LLMs）在单轮文本到SQL任务中展现了卓越的性能，然而现实世界的数据库应用主要依赖于多轮交互来处理模糊查询、执行错误以及不断变化的用户需求。现有的多轮基准测试存在不足，它们将对话历史视为静态上下文或仅限于只读操作评估，未能反映生产级数据库助手所面临的挑战。为此，我们引入了BIRD-INTERACT基准，通过以下方式恢复真实感：（1）构建一个综合交互环境，将每个数据库与层次化知识库、元数据文件及函数驱动的用户模拟器相结合，使模型能在无需人工监督的情况下请求澄清、检索知识并从错误中恢复；（2）提供两种评估设置，包括预定义的对话协议（c-Interact）和开放式代理设置（a-Interact），在后者中模型自主决定何时查询用户模拟器或探索环境；（3）设计一套涵盖业务智能和操作用例全CRUD（增删改查）范围的挑战性任务集，并配备可执行的测试用例进行保护。每项任务均包含需要动态交互的模糊及后续子任务。该套件包含BIRD-INTERACT-FULL（600项任务，最多11,796次交互）用于全面性能评估，以及BIRD-INTERACT-LITE（300项任务，采用简化数据库）用于详细行为分析和快速方法开发。我们的实证结果凸显了BIRD-INTERACT的难度：GPT-5在c-Interact中仅完成8.67%的任务，在a-Interact中完成17.00%。通过记忆嫁接和交互测试时缩放分析，验证了有效交互对于复杂、动态文本到SQL任务的重要性。

English

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.

BIRD-INTERACT：通过动态交互视角重构大型语言模型的文本到SQL评估

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

摘要

Support