μ^2Tokenizer：可微分多尺度多模態標記器用於放射學報告生成

摘要

自動化放射學報告生成（RRG）旨在從臨床影像（如電腦斷層掃描CT）中產生詳細的文字報告，以提高診斷的準確性和效率，並提供管理建議。RRG面臨兩個主要挑戰：(1) 在資源限制下從影像數據中提取相關信息的固有複雜性，(2) 客觀評估模型生成報告與專家撰寫報告之間差異的困難。為應對這些挑戰，我們提出了mu^2LLM，這是一種用於RRG任務的多尺度多模態大型語言模型。新穎的{mu}^2Tokenizer作為中間層，整合了來自多尺度視覺標記器和文本標記器的多模態特徵，並通過GREEN-RedLlama指導的直接偏好優化（DPO）提升報告生成質量。在四個大型CT影像報告醫學數據集上的實驗結果表明，我們的方法優於現有方法，凸顯了我們在有限數據上微調的mu^2LLM在RRG任務中的潛力。

English

Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose mu^2LLM, a textbf{mu}ltiscale textbf{mu}ltimodal large language models for RRG tasks. The novel {mu}^2Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by GREEN-RedLlama. Experimental results on four large CT image-report medical datasetdemonstrate that our method outperforms existing approaches, highlighting the potential of our fine-tuned mu^2LLMs on limited data for RRG tasks.