迈向更优牙科AI：全景X光分析的多模态基准与指令数据集

摘要

近期，大规模视觉语言模型（LVLMs）在通用医疗任务中展现了卓越的性能。然而，其在牙科等专业领域的有效性仍待深入探索。特别是，全景X射线作为口腔放射学中广泛应用的成像方式，由于密集的解剖结构和细微的病理线索，其解读面临挑战，这些挑战并未被现有的医疗基准或指令数据集所涵盖。为此，我们推出了MMOral，这是首个专为全景X射线解读设计的大规模多模态指令数据集和基准。MMOral包含20,563张标注图像，配以130万条跨多种任务类型的指令跟随实例，涵盖属性提取、报告生成、视觉问答及基于图像的对话等。此外，我们提出了MMOral-Bench，一个全面评估套件，覆盖牙科诊断的五个关键维度。我们在MMOral-Bench上评估了64个LVLMs，发现即使表现最佳的模型，即GPT-4o，其准确率也仅为41.45%，揭示了当前模型在该领域的显著局限性。为促进这一特定领域的进步，我们还提出了OralGPT，它基于Qwen2.5-VL-7B模型，利用我们精心策划的MMOral指令数据集进行监督微调（SFT）。值得注意的是，仅一次SFT周期便显著提升了LVLMs的性能，例如OralGPT实现了24.73%的改进。MMOral和OralGPT均具备作为智能牙科关键基础的巨大潜力，并有望推动牙科领域更具临床影响力的多模态AI系统的发展。数据集、模型、基准及评估套件可在https://github.com/isbrycee/OralGPT获取。

English

Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.

迈向更优牙科AI：全景X光分析的多模态基准与指令数据集

Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

摘要

Support