ChatPaper.aiChatPaper

LucidFlux:基於大規模擴散變換器的無標註通用圖像修復

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

September 26, 2025
作者: Song Fei, Tian Ye, Lujia Wang, Lei Zhu
cs.AI

摘要

通用圖像修復(UIR)旨在恢復因未知混合因素而退化的圖像,同時保持語義——在這些條件下,判別式修復器和基於UNet的擴散先驗往往會過度平滑、產生幻覺或偏離。我們提出了LucidFlux,這是一個無需圖說(caption-free)的UIR框架,它適應了一個大型擴散變換器(Flux.1),且不依賴圖像圖說。LucidFlux引入了一個輕量級的雙分支條件器,該條件器從退化的輸入和輕度恢復的代理中注入信號,分別錨定幾何結構並抑制偽影。接著,設計了一個基於時間步和層次的自適應調製計劃,以在骨幹網絡的層次結構中傳遞這些線索,從而產生從粗到細且上下文感知的更新,在恢復紋理的同時保護全局結構。此外,為了避免文本提示或多模態大語言模型(MLLM)圖說的延遲和不穩定性,我們通過從代理中提取的SigLIP特徵來強制實現無需圖說的語義對齊。一個可擴展的數據篩選管道進一步過濾大規模數據,以獲得結構豐富的監督。在合成和真實場景的基準測試中,LucidFlux始終優於強大的開源和商業基線,消融研究驗證了每個組件的必要性。LucidFlux表明,對於大型擴散變換器(DiTs)而言,何時、何地以及對什麼進行條件化——而不是增加參數或依賴文本提示——是在真實場景中實現魯棒且無需圖說的通用圖像修復的關鍵槓桿。
English
Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics -- conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone's hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on -- rather than adding parameters or relying on text prompts -- is the governing lever for robust and caption-free universal image restoration in the wild.
PDF173September 29, 2025