더 나은 치과 AI를 향해: 파노라마 X선 분석을 위한 멀티모달 벤치마크 및 지시 데이터셋

초록

대규모 시각-언어 모델(LVLMs)의 최근 발전은 일반적인 의료 작업에서 강력한 성능을 보여주고 있습니다. 그러나 치과와 같은 특수 분야에서의 효과는 아직 충분히 탐구되지 않았습니다. 특히, 구강 방사선학에서 널리 사용되는 파노라마 X-선은 밀집된 해부학적 구조와 미묘한 병리학적 단서로 인해 해석상의 어려움을 야기하며, 이는 기존의 의료 벤치마크나 지시 데이터셋에서 포착되지 않습니다. 이를 위해, 우리는 파노라마 X-선 해석을 위해 맞춤화된 최초의 대규모 멀티모달 지시 데이터셋 및 벤치마크인 MMOral을 소개합니다. MMOral은 20,563개의 주석이 달린 이미지와 130만 개의 지시-따르기 인스턴스로 구성되어 있으며, 속성 추출, 보고서 생성, 시각적 질문 응답, 이미지 기반 대화 등 다양한 작업 유형을 포함합니다. 또한, 우리는 치과에서의 다섯 가지 주요 진단 차원을 포괄하는 종합 평가 도구인 MMOral-Bench를 제시합니다. 우리는 MMOral-Bench에서 64개의 LVLM을 평가했으며, 가장 성능이 뛰어난 모델인 GPT-4o조차도 41.45%의 정확도만 달성하여 현재 모델의 이 분야에서의 상당한 한계를 드러냈습니다. 이 특정 분야의 발전을 촉진하기 위해, 우리는 또한 Qwen2.5-VL-7B를 기반으로 우리가 신중하게 선별한 MMOral 지시 데이터셋을 사용하여 지도 미세 조정(SFT)을 수행하는 OralGPT를 제안합니다. 놀랍게도, 단일 에포크의 SFT는 LVLM에 상당한 성능 향상을 가져왔으며, 예를 들어 OralGPT는 24.73%의 개선을 보여주었습니다. MMOral과 OralGPT 모두 지능형 치과를 위한 중요한 기반이 되며, 치과 분야에서 더 임상적으로 영향력 있는 멀티모달 AI 시스템을 가능하게 할 잠재력을 가지고 있습니다. 데이터셋, 모델, 벤치마크 및 평가 도구는 https://github.com/isbrycee/OralGPT에서 이용할 수 있습니다.

English

Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.

더 나은 치과 AI를 향해: 파노라마 X선 분석을 위한 멀티모달 벤치마크 및 지시 데이터셋

Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

초록

Support