BiomedSQL:面向生物医学知识库科学推理的文本到SQL转换系统
BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
May 23, 2025
作者: Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
cs.AI
摘要
生物医学研究人员日益依赖大规模结构化数据库进行复杂分析任务。然而,现有的文本到SQL系统往往难以将定性的科学问题映射为可执行的SQL查询,尤其是在需要隐含领域推理的情况下。我们推出了BiomedSQL,这是首个专门设计用于评估在真实世界生物医学知识库上进行文本到SQL生成时科学推理能力的基准。BiomedSQL包含68,000个基于统一BigQuery知识库的问题/SQL查询/答案三元组,该知识库整合了基因-疾病关联、来自组学数据的因果推断以及药物审批记录。每个问题都要求模型推断领域特定标准,如全基因组显著性阈值、效应方向性或试验阶段过滤,而非仅依赖语法翻译。我们评估了一系列开源和闭源的大型语言模型(LLMs),涵盖不同的提示策略和交互范式。结果显示显著的性能差距:GPT-o3-mini的执行准确率为59.0%,而我们定制的多步代理BMSQL达到62.6%,均远低于专家基线90.0%。BiomedSQL为推进能够通过结构化生物医学知识库的稳健推理支持科学发现的文本到SQL系统奠定了新基础。我们的数据集公开于https://huggingface.co/datasets/NIH-CARD/BiomedSQL,代码开源于https://github.com/NIH-CARD/biomedsql。
English
Biomedical researchers increasingly rely on large-scale structured databases
for complex analytical tasks. However, current text-to-SQL systems often
struggle to map qualitative scientific questions into executable SQL,
particularly when implicit domain reasoning is required. We introduce
BiomedSQL, the first benchmark explicitly designed to evaluate scientific
reasoning in text-to-SQL generation over a real-world biomedical knowledge
base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in
a harmonized BigQuery knowledge base that integrates gene-disease associations,
causal inference from omics data, and drug approval records. Each question
requires models to infer domain-specific criteria, such as genome-wide
significance thresholds, effect directionality, or trial phase filtering,
rather than rely on syntactic translation alone. We evaluate a range of open-
and closed-source LLMs across prompting strategies and interaction paradigms.
Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0%
execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%,
both well below the expert baseline of 90.0%. BiomedSQL provides a new
foundation for advancing text-to-SQL systems capable of supporting scientific
discovery through robust reasoning over structured biomedical knowledge bases.
Our dataset is publicly available at
https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source
at https://github.com/NIH-CARD/biomedsql.Summary
AI-Generated Summary