μ^2Tokenizer:可微分多尺度多模态分词器在放射报告生成中的应用
μ^2Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation
June 30, 2025
作者: Siyou Li, Pengyao Qin, Huanan Wu, Dong Nie, Arun J. Thirunavukarasu, Juntao Yu, Le Zhang
cs.AI
摘要
自动化放射报告生成(RRG)旨在通过临床影像,如计算机断层扫描(CT),生成详尽的文本报告,以提升诊断的准确性和管理建议的提供效率。RRG面临两大关键挑战:(1)在资源受限条件下从影像数据中提取相关信息的内在复杂性;(2)客观评估模型生成报告与专家撰写报告之间差异的难度。为应对这些挑战,我们提出了mu^2LLM,一种用于RRG任务的**多尺度多模态大语言模型**。新颖的{mu}^2Tokenizer作为中间层,整合了来自多尺度视觉分词器和文本分词器的多模态特征,随后在GREEN-RedLlama的指导下,通过直接偏好优化(DPO)提升报告生成质量。在四个大型CT影像-报告医疗数据集上的实验结果表明,我们的方法超越了现有技术,凸显了在有限数据上微调的mu^2LLM在RRG任务中的潜力。
English
Automated radiology report generation (RRG) aims to produce detailed textual
reports from clinical imaging, such as computed tomography (CT) scans, to
improve the accuracy and efficiency of diagnosis and provision of management
advice. RRG is complicated by two key challenges: (1) inherent complexity in
extracting relevant information from imaging data under resource constraints,
and (2) difficulty in objectively evaluating discrepancies between
model-generated and expert-written reports. To address these challenges, we
propose mu^2LLM, a textbf{mu}ltiscale
textbf{mu}ltimodal large language models for RRG tasks. The
novel {mu}^2Tokenizer, as an intermediate layer, integrates multi-modal
features from the multiscale visual tokenizer and the text tokenizer, then
enhances report generation quality through direct preference optimization
(DPO), guided by GREEN-RedLlama. Experimental results on four large CT
image-report medical datasetdemonstrate that our method outperforms existing
approaches, highlighting the potential of our fine-tuned mu^2LLMs on limited
data for RRG tasks.