回声：基于一步块扩散的高效胸部X光报告生成

摘要

胸部X光报告生成（CXR-RG）技术有望显著减轻放射科医生的工作负担。然而，传统自回归视觉语言模型（VLM）因采用序列化令牌解码而存在高推理延迟问题。基于扩散的模型通过并行生成提供了有前景的替代方案，但仍需多次去噪迭代。将多步去噪压缩至单步可进一步降低延迟，但令牌分解去噪器引入的均值场偏差往往会导致文本连贯性下降。为解决这一挑战，我们提出ECHO——一种高效的基于扩散的视觉语言模型（dVLM），专用于胸部X光报告生成。ECHO通过新型直接条件蒸馏（DCD）框架实现稳定的单步分块推理，该框架通过从策略内扩散轨迹构建非分解监督来编码令牌联合依赖关系，从而克服均值场限制。此外，我们引入响应非对称扩散（RAD）训练策略，在保持模型效能的同时进一步提升训练效率。大量实验表明，ECHO在RaTE和SemScore指标上分别超越现有最优自回归方法64.33%和60.58%，在实现8倍推理加速的同时保持了临床准确性。

English

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose ECHO, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by 64.33\% and 60.58\% respectively, while achieving an 8times inference speedup without compromising clinical accuracy.

回声：基于一步块扩散的高效胸部X光报告生成

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

摘要

Support