より優れた歯科AIに向けて：パノラマX線分析のためのマルチモーダルベンチマークと指示データセット

要旨

大規模視覚言語モデル（LVLM）の最近の進展により、汎用医療タスクにおける強力な性能が実証されています。しかし、歯科などの専門領域における有効性はまだ十分に検証されていません。特に、口腔放射線学で広く使用されているパノラマX線画像は、密集した解剖学的構造や微妙な病理学的兆候により解釈が困難であり、既存の医療ベンチマークや指示データセットでは捕捉されていません。この問題に対処するため、我々はパノラマX線画像解釈に特化した初の大規模マルチモーダル指示データセットおよびベンチマークであるMMOralを導入します。MMOralは20,563枚の注釈付き画像と130万件の指示追従インスタンスで構成され、属性抽出、レポート生成、視覚的質問応答、画像に基づく対話など多様なタスクタイプをカバーしています。さらに、歯科診断における5つの主要な次元を網羅した包括的な評価スイートであるMMOral-Benchを提示します。MMOral-Benchで64のLVLMを評価した結果、最高性能のモデルであるGPT-4oでさえ41.45%の精度しか達成できず、この領域における現行モデルの重大な限界が明らかになりました。この特定領域の進展を促進するため、我々はQwen2.5-VL-7Bを基にMMOral指示データセットを用いて教師ありファインチューニング（SFT）を行うOralGPTも提案します。注目すべきことに、単一エポックのSFTでLVLMの性能が大幅に向上し、例えばOralGPTは24.73%の改善を示しました。MMOralとOralGPTは、インテリジェント歯科の重要な基盤として、また歯科領域におけるより臨床的にインパクトのあるマルチモーダルAIシステムを実現するための大きな可能性を秘めています。データセット、モデル、ベンチマーク、評価スイートはhttps://github.com/isbrycee/OralGPTで公開されています。

English

Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.