SpineBench：基于SpineMed-450k语料库构建的临床显著性、层级感知基准测试

摘要

脊柱疾病影响着全球6.19亿人口，是导致残疾的主要原因之一，然而AI辅助诊断仍因缺乏层级感知的多模态数据集而受限。脊柱疾病的临床决策需要在特定椎体层面上对X光、CT和MRI进行复杂推理。然而，由于缺乏可追溯的、基于临床的指导数据以及标准化的脊柱专用基准，这一领域的进展受到限制。为此，我们推出了SpineMed，这是一个与执业脊柱外科医生共同设计的生态系统。它包含SpineMed-450k，这是首个专为跨影像模态的椎体层级推理设计的大规模数据集，拥有超过45万条指导实例，以及SpineBench，一个基于临床的评估框架。SpineMed-450k从多种来源精心挑选，包括教科书、指南、开放数据集和约1000例去识别化的医院病例，采用临床医生参与的两阶段大语言模型生成方法（草稿与修订），以确保高质量、可追溯的数据，用于问答、多轮咨询和报告生成。SpineBench在临床关键维度上评估模型，包括层级识别、病理评估和手术规划。我们对多个近期先进的大型视觉语言模型（LVLMs）在SpineBench上的全面评估揭示了在细粒度、层级特定推理方面的系统性弱点。相比之下，基于SpineMed-450k微调的模型在所有任务上均展现出持续且显著的改进。临床医生的评估证实了我们模型输出的诊断清晰度和实际应用价值。

English

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.