BIRD-INTERACT：動的インタラクションの視点から見た大規模言語モデルのためのText-to-SQL評価の再考

要旨

大規模言語モデル（LLM）は、シングルターンのテキストからSQLへのタスクにおいて顕著な性能を発揮しているが、現実世界のデータベースアプリケーションでは、曖昧なクエリの処理、実行エラー、および変化するユーザー要件に対応するために、主にマルチターンのインタラクションが必要とされる。既存のマルチターナーベンチマークは、会話履歴を静的なコンテキストとして扱うか、読み取り専用操作に評価を限定しており、本番環境レベルのデータベースアシスタントの課題を反映できていない。本論文では、BIRD-INTERACTを紹介する。このベンチマークは、以下の点を通じて現実性を回復する：（1）各データベースを階層的な知識ベース、メタデータファイル、および関数駆動型ユーザーシミュレータと結合した包括的なインタラクション環境を提供し、モデルが人間の監督なしに明確化を求め、知識を取得し、エラーから回復できるようにする；（2）事前定義された会話プロトコル（c-Interact）と、モデルがユーザーシミュレータにクエリを送信するか環境を探索するかを自律的に決定するオープンエンドのエージェント設定（a-Interact）の2つの評価設定；（3）ビジネスインテリジェンスおよび運用ユースケースのための完全なCRUDスペクトラムをカバーする挑戦的なタスクスイートで、実行可能なテストケースによって保護されている。各タスクには、動的なインタラクションを必要とする曖昧なサブタスクとフォローアップタスクが含まれる。このスイートは、包括的なパフォーマンス評価のためのBIRD-INTERACT-FULL（600タスク、最大11,796インタラクション）と、詳細な行動分析と迅速なメソッド開発のためのBIRD-INTERACT-LITE（簡略化されたデータベースを持つ300タスク）で構成される。我々の実証結果は、BIRD-INTERACTの難易度を強調している：GPT-5はc-Interactで8.67%、a-Interactで17.00%のタスクしか完了できない。メモリグラフトとインタラクションテストタイムスケーリングによる分析は、複雑で動的なテキストからSQLへのタスクにおいて、効果的なインタラクションの重要性を検証する。

English

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.

BIRD-INTERACT：動的インタラクションの視点から見た大規模言語モデルのためのText-to-SQL評価の再考

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

要旨

Support