ECHO: ワンステップブロック拡散による効率的な胸部X線レポート生成

要旨

胸部X線報告生成（CXR-RG）は、放射線科医の負担を大幅に軽減する可能性を秘めている。しかし、従来の自己回帰型視覚言語モデル（VLM）は、トークンの逐次復号による高い推論遅延に悩まされている。拡散ベースのモデルは並列生成を通じて有望な代替手段を提供するが、それでも複数のノイズ除去反復を必要とする。多段階のノイズ除去を単一段階に圧縮すれば遅延をさらに削減できるが、トークン分解型ノイズ除去器が導入する平均場バイアスのため、テキストの一貫性が損なわれることが多い。この課題に対処するため、我々は胸部X線報告生成のための効率的な拡散ベースVLM（dVLM）であるECHOを提案する。ECHOは、新しいDirect Conditional Distillation（DCD）フレームワークにより、ブロック単位での安定したワンステップ推論を可能にする。DCDは、オンポリシーな拡散軌道から非分解型の教師信号を構築して結合トークン依存性を符号化することで、平均場の制限を緩和する。さらに、訓練効率をさらに向上させつつモデルの有効性を維持するResponse-Asymmetric Diffusion（RAD）訓練戦略を導入する。大規模な実験により、ECHOが最先端の自己回帰手法を凌駕し、RaTEとSemScoreをそれぞれ64.33％、60.58％改善するとともに、臨床精度を損なうことなく8倍の推論高速化を達成することを実証した。

English

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose ECHO, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by 64.33\% and 60.58\% respectively, while achieving an 8times inference speedup without compromising clinical accuracy.

ECHO: ワンステップブロック拡散による効率的な胸部X線レポート生成

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

要旨

Support