TRUST-SQL：面向未知模式文本轉SQL的工具整合式多輪強化學習

摘要

在完整模式假設下，文字轉SQL解析技術已取得顯著進展。然而在真實企業環境中，由於資料庫包含數百個具有大量雜訊中繼資料的資料表，此前提往往難以成立。與其預先注入完整模式，智慧代理必須主動識別並驗證相關子集，由此催生出本文研究的「未知模式」情境。為解決此問題，我們提出TRUST-SQL框架（基於工具的真實模式未知推理）。我們將該任務建模為部分可觀測馬可夫決策過程，使自主代理能透過結構化的四階段協議，將推理過程錨定於經過驗證的中繼資料。關鍵在於，該協議為我們新穎的雙軌GRPO策略提供了結構化邊界。透過應用詞元層級的掩碼優勢值，此策略能將探索獎勵與執行結果分離以解決信用分配問題，相較標準GRPO實現了9.9%的相對提升。在五個基準測試上的大量實驗表明，TRUST-SQL的4B與8B變體相較基礎模型分別實現了30.6%與16.6%的平均絕對提升。值得注意的是，儘管完全無需預載中繼資料，我們的框架始終達到甚至超越了依賴模式預填充的強基線模型。

English

Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.

TRUST-SQL：面向未知模式文本轉SQL的工具整合式多輪強化學習

TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

摘要

Support