ChatPaper.aiChatPaper

SpineBench:基於SpineMed-450k語料庫構建的臨床顯著性、層級感知基準測試

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

October 3, 2025
作者: Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan
cs.AI

摘要

脊椎疾病影響全球6.19億人口,是導致殘疾的主要原因之一,然而人工智慧輔助診斷仍受限於缺乏層級感知的多模態數據集。脊椎疾病的臨床決策需要跨X光、CT和MRI在特定椎骨層面進行複雜推理。然而,由於缺乏可追溯、基於臨床的指導數據和標準化的脊椎專用基準,這一領域的進展受到限制。為此,我們推出了SpineMed,這是一個與執業脊椎外科醫生共同設計的生態系統。它包含SpineMed-450k,這是首個專為跨影像模態的椎骨層面推理設計的大規模數據集,擁有超過45萬條指導實例,以及SpineBench,一個基於臨床的評估框架。SpineMed-450k從多樣化來源中精心挑選,包括教科書、指南、開放數據集和約1,000例去識別化的醫院案例,採用臨床醫生參與的兩階段大語言模型生成方法(草稿和修訂),以確保高質量、可追溯的數據,用於問答、多輪諮詢和報告生成。SpineBench在臨床重要軸線上評估模型,包括層級識別、病理評估和手術規劃。我們對SpineBench上幾種最新進展的大型視覺語言模型(LVLMs)進行全面評估,揭示了在細粒度、層級特定推理方面的系統性弱點。相比之下,我們在SpineMed-450k上微調的模型在所有任務中均展現出持續且顯著的改進。臨床醫生的評估證實了我們模型輸出的診斷清晰度和實際效用。
English
Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.
PDF42October 6, 2025