SpineBench:基于SpineMed-450k语料库构建的临床显著性、层级感知基准测试
SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
October 3, 2025
作者: Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan
cs.AI
摘要
脊柱疾病影响着全球6.19亿人口,是导致残疾的主要原因之一,然而AI辅助诊断仍因缺乏层级感知的多模态数据集而受限。脊柱疾病的临床决策需要在特定椎体层面上对X光、CT和MRI进行复杂推理。然而,由于缺乏可追溯的、基于临床的指导数据以及标准化的脊柱专用基准,这一领域的进展受到限制。为此,我们推出了SpineMed,这是一个与执业脊柱外科医生共同设计的生态系统。它包含SpineMed-450k,这是首个专为跨影像模态的椎体层级推理设计的大规模数据集,拥有超过45万条指导实例,以及SpineBench,一个基于临床的评估框架。SpineMed-450k从多种来源精心挑选,包括教科书、指南、开放数据集和约1000例去识别化的医院病例,采用临床医生参与的两阶段大语言模型生成方法(草稿与修订),以确保高质量、可追溯的数据,用于问答、多轮咨询和报告生成。SpineBench在临床关键维度上评估模型,包括层级识别、病理评估和手术规划。我们对多个近期先进的大型视觉语言模型(LVLMs)在SpineBench上的全面评估揭示了在细粒度、层级特定推理方面的系统性弱点。相比之下,基于SpineMed-450k微调的模型在所有任务上均展现出持续且显著的改进。临床医生的评估证实了我们模型输出的诊断清晰度和实际应用价值。
English
Spine disorders affect 619 million people globally and are a leading cause of
disability, yet AI-assisted diagnosis remains limited by the lack of
level-aware, multimodal datasets. Clinical decision-making for spine disorders
requires sophisticated reasoning across X-ray, CT, and MRI at specific
vertebral levels. However, progress has been constrained by the absence of
traceable, clinically-grounded instruction data and standardized,
spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem
co-designed with practicing spine surgeons. It features SpineMed-450k, the
first large-scale dataset explicitly designed for vertebral-level reasoning
across imaging modalities with over 450,000 instruction instances, and
SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is
curated from diverse sources, including textbooks, guidelines, open datasets,
and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline
with a two-stage LLM generation method (draft and revision) to ensure
high-quality, traceable data for question-answering, multi-turn consultations,
and report generation. SpineBench evaluates models on clinically salient axes,
including level identification, pathology assessment, and surgical planning.
Our comprehensive evaluation of several recently advanced large vision-language
models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained,
level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k
demonstrates consistent and significant improvements across all tasks.
Clinician assessments confirm the diagnostic clarity and practical utility of
our model's outputs.