SpineBench：基於SpineMed-450k語料庫構建的臨床顯著性、層級感知基準測試

摘要

脊椎疾病影響全球6.19億人口，是導致殘疾的主要原因之一，然而人工智慧輔助診斷仍受限於缺乏層級感知的多模態數據集。脊椎疾病的臨床決策需要跨X光、CT和MRI在特定椎骨層面進行複雜推理。然而，由於缺乏可追溯、基於臨床的指導數據和標準化的脊椎專用基準，這一領域的進展受到限制。為此，我們推出了SpineMed，這是一個與執業脊椎外科醫生共同設計的生態系統。它包含SpineMed-450k，這是首個專為跨影像模態的椎骨層面推理設計的大規模數據集，擁有超過45萬條指導實例，以及SpineBench，一個基於臨床的評估框架。SpineMed-450k從多樣化來源中精心挑選，包括教科書、指南、開放數據集和約1,000例去識別化的醫院案例，採用臨床醫生參與的兩階段大語言模型生成方法（草稿和修訂），以確保高質量、可追溯的數據，用於問答、多輪諮詢和報告生成。SpineBench在臨床重要軸線上評估模型，包括層級識別、病理評估和手術規劃。我們對SpineBench上幾種最新進展的大型視覺語言模型（LVLMs）進行全面評估，揭示了在細粒度、層級特定推理方面的系統性弱點。相比之下，我們在SpineMed-450k上微調的模型在所有任務中均展現出持續且顯著的改進。臨床醫生的評估證實了我們模型輸出的診斷清晰度和實際效用。

English

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.