LucidFlux:基于大规模扩散变换器的无字幕通用图像修复
LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
September 26, 2025
作者: Song Fei, Tian Ye, Lujia Wang, Lei Zhu
cs.AI
摘要
通用图像修复(UIR)旨在恢复因未知混合因素而退化的图像,同时保持语义完整性——在此条件下,判别式修复器和基于UNet的扩散先验往往会导致过度平滑、幻觉或漂移。我们提出了LucidFlux,一个无需图像描述的UIR框架,它适配了一个大型扩散变换器(Flux.1)而无需依赖图像描述。LucidFlux引入了一种轻量级的双分支条件器,该条件器从退化输入和轻度修复的代理中分别注入信号,以锚定几何结构并抑制伪影。随后,设计了一种时间步和层级自适应的调制调度策略,将这些线索在骨干网络层次结构中传递,从而生成从粗到细且上下文感知的更新,在恢复纹理的同时保护全局结构。此外,为了避免文本提示或多模态大语言模型描述带来的延迟和不稳定性,我们通过从代理中提取的SigLIP特征强制执行无描述语义对齐。一个可扩展的筛选管道进一步过滤大规模数据,以提供结构丰富的监督。在合成和真实场景的基准测试中,LucidFlux始终优于强大的开源和商业基线,消融研究验证了每个组件的必要性。LucidFlux表明,对于大型扩散变换器而言,何时、何地以及基于什么进行条件化——而非增加参数或依赖文本提示——是实现鲁棒且无需描述的通用图像修复的关键杠杆。
English
Universal image restoration (UIR) aims to recover images degraded by unknown
mixtures while preserving semantics -- conditions under which discriminative
restorers and UNet-based diffusion priors often oversmooth, hallucinate, or
drift. We present LucidFlux, a caption-free UIR framework that adapts a large
diffusion transformer (Flux.1) without image captions. LucidFlux introduces a
lightweight dual-branch conditioner that injects signals from the degraded
input and a lightly restored proxy to respectively anchor geometry and suppress
artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed
to route these cues across the backbone's hierarchy, in order to yield
coarse-to-fine and context-aware updates that protect the global structure
while recovering texture. After that, to avoid the latency and instability of
text prompts or MLLM captions, we enforce caption-free semantic alignment via
SigLIP features extracted from the proxy. A scalable curation pipeline further
filters large-scale data for structure-rich supervision. Across synthetic and
in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source
and commercial baselines, and ablation studies verify the necessity of each
component. LucidFlux shows that, for large DiTs, when, where, and what to
condition on -- rather than adding parameters or relying on text prompts -- is
the governing lever for robust and caption-free universal image restoration in
the wild.