领域特定隐表征提升基于扩散模型的医学图像超分辨率重建保真度
Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution
April 14, 2026
作者: Sebastian Cajas, Ashaba Judith, Rahul Gorijavolu, Sahil Kapadia, Hillary Clinton Kasimbazi, Leo Kinyera, Emmanuel Paul Kwesiga, Sri Sri Jaithra Varma Manthena, Luis Filipe Nakayama, Ninsiima Doreen, Leo Anthony Celi
cs.AI
摘要
医学图像超分辨率的潜在扩散模型普遍沿用了为自然图像设计的变分自编码器。我们通过实验证明,制约重建质量的关键因素并非扩散架构本身,而是这一默认选择。在控制其他流程组件不变的实验中,将通用Stable Diffusion VAE替换为MedVAE(基于160万张医学图像预训练的专业自编码器)后,膝关节MRI、脑部MRI和胸部X线图像(n=1,820)的PSNR指标提升2.91-3.29 dB(Cohen's d=1.37-1.86,所有p<10^{-20},Wil克斯康符号秩检验)。小波分解表明该优势主要集中在编码解剖相关精细结构的最高空间频带。在不同推理调度、预测目标和生成架构上的消融实验证实,该性能差距稳定在±0.15 dB范围内,而各方法的幻象生成率保持相当(所有数据集的Cohen's h<0.02),表明重建保真度与生成幻象受控于相互独立的流程组件。这些结果提供了实用筛选标准:无需扩散训练即可测量的自编码器重建质量可预测下游超分辨率性能(R²=0.67),提示应优先进行专业VAE筛选而非扩散架构搜索。代码与训练模型权重已开源:https://github.com/sebasmos/latent-sr。
English
Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent-sr.