迴聲：基於一步區塊擴散的高效胸部X光報告生成系統

摘要

胸部X光報告生成技術能有效減輕放射科醫師的工作負荷。然而，傳統的自回歸視覺語言模型因需依序解碼標記而存在高推論延遲問題。基於擴散模型的架構透過平行生成提供可行替代方案，但仍需多次去噪迭代。將多步去噪壓縮為單步處理可進一步降低延遲，但由於標記因子化去噪器引入的均值場偏差，往往導致文本連貫性下降。為解決此難題，我們提出ECHO——一種高效的基於擴散的視覺語言模型，專用於胸部X光報告生成。ECHO透過創新的直接條件蒸餾框架實現穩定的每區塊單步推論，該框架通過從策略內擴散軌跡構建非因子化監督來編碼聯合標記依賴關係，從而克服均值場限制。此外，我們引入回應非對稱擴散訓練策略，在維持模型效能的同時進一步提升訓練效率。大量實驗表明，ECHO在RaTE與SemScore指標上分別超越現有自回歸方法64.33%與60.58%，同時實現8倍推論加速且不影響臨床準確性。

English

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose ECHO, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by 64.33\% and 60.58\% respectively, while achieving an 8times inference speedup without compromising clinical accuracy.

迴聲：基於一步區塊擴散的高效胸部X光報告生成系統

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

摘要

Support