领域特定隐表示提升基于扩散模型的医学图像超分辨率保真度
Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution
April 14, 2026
作者: Sebastian Cajas, Ashaba Judith, Rahul Gorijavolu, Sahil Kapadia, Hillary Clinton Kasimbazi, Leo Kinyera, Emmanuel Paul Kwesiga, Sri Sri Jaithra Varma Manthena, Luis Filipe Nakayama, Ninsiima Doreen, Leo Anthony Celi
cs.AI
摘要
当前医学图像超分辨率研究普遍沿用了针对自然图像设计的变分自编码器。本研究表明,制约重建质量的关键因素并非扩散架构本身,而是这一默认的编码器选择。在控制其他流程组件不变的实验中,将通用Stable Diffusion VAE替换为MedVAE(基于160万张医学图像预训练的专业自编码器)后,膝关节MRI、脑部MRI和胸部X光图像(n=1,820)的峰值信噪比提升达+2.91至+3.29 dB(Cohen's d=1.37-1.86,所有p<10^{-20},Wilcoxon符号秩检验)。小波分解表明该优势主要集中于编码解剖学细微结构的最高空间频带。在不同推理调度、预测目标和生成架构下的消融实验证实,该性能差距稳定在±0.15 dB内,而不同方法的幻象生成率保持相当(所有数据集的Cohen's h<0.02),证明重建保真度与生成幻象受控于流程中相互独立的组件。这些结果提供了实用筛选标准:无需扩散训练即可测量的自编码器重建质量可预测下游超分辨率性能(R²=0.67),提示应优先进行专业VAE筛选而非扩散架构搜索。代码与训练模型权重已公开于https://github.com/sebasmos/latent-sr。
English
Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent-sr.