TRUST-SQL：面向未知模式文本到SQL的工具集成式多轮强化学习

摘要

在完整模式假设下，文本到SQL解析已取得显著进展。然而，这一前提在真实企业环境中并不成立——此类数据库往往包含数百个具有海量噪声元数据的表格。我们提出的解决方案不是预先注入完整模式，而是让智能体主动识别并验证相关子集，由此催生了本文研究的未知模式场景。为此，我们提出TRUST-SQL框架（基于工具的真实模式未知推理）。我们将该任务建模为部分可观测马尔可夫决策过程，其中自主智能体采用结构化四阶段协议，将推理过程锚定于经过验证的元数据。该协议的关键作用是为我们新颖的双轨GRPO策略提供结构化边界：通过应用令牌级掩码优势度，该策略将探索奖励与执行结果相分离以解决信用分配问题，最终实现相比标准GRPO 9.9%的相对提升。在五个基准测试上的大量实验表明，TRUST-SQL的4B和8B变体相比基础模型分别实现了30.6%和16.6%的平均绝对提升。值得注意的是，尽管完全无需预加载元数据，我们的框架始终达到甚至超越了依赖模式预填充的强基线模型。

English

Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.