SpineBench: SpineMed-450kコーパスを基盤とした臨床的に有意義なレベル認識ベンチマーク

要旨

脊椎疾患は世界中で6億1900万人に影響を及ぼし、障害の主要な原因となっていますが、AIによる診断支援は、レベルを意識したマルチモーダルデータセットの不足によって制限されています。脊椎疾患の臨床意思決定には、特定の椎骨レベルにおけるX線、CT、MRIを横断した高度な推論が必要です。しかし、追跡可能で臨床に基づいた指示データや標準化された脊椎特化のベンチマークが欠如しているため、進展が妨げられています。この問題に対処するため、私たちは現役の脊椎外科医と共同設計したエコシステム「SpineMed」を導入しました。SpineMedは、450,000以上の指示インスタンスを備えた、画像モダリティを横断した椎骨レベルの推論に特化した初の大規模データセット「SpineMed-450k」と、臨床に基づいた評価フレームワーク「SpineBench」を特徴としています。SpineMed-450kは、教科書、ガイドライン、オープンデータセット、および約1,000件の匿名化された病院症例など、多様なソースからキュレーションされ、臨床医がループ内にいるパイプラインと二段階のLLM生成方法（草案と修正）を使用して、質の高く追跡可能なデータを確保し、質問応答、多段階相談、およびレポート生成に活用されます。SpineBenchは、レベル識別、病理評価、手術計画など、臨床的に重要な軸に沿ってモデルを評価します。SpineBenchを用いた最近の大規模視覚言語モデル（LVLM）の包括的評価では、細かいレベル特化の推論における体系的な弱点が明らかになりました。一方、SpineMed-450kでファインチューニングされた私たちのモデルは、すべてのタスクにおいて一貫して大幅な改善を示しました。臨床医の評価により、モデルの出力の診断の明確さと実用性が確認されました。

English

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

SpineBench: SpineMed-450kコーパスを基盤とした臨床的に有意義なレベル認識ベンチマーク

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

要旨

Support