LucidFlux: 大規模拡散トランスフォーマーによるキャプションフリーの汎用画像修復

要旨

ユニバーサル画像復元（UIR）は、未知の混合による劣化を被った画像を意味を保ちつつ復元することを目的としている。この条件下では、識別型復元器やUNetベースの拡散事前分布は、しばしば過剰に平滑化したり、幻覚を生じさせたり、またはドリフトを引き起こすことがある。本論文では、キャプションを必要としないUIRフレームワークであるLucidFluxを提案する。LucidFluxは、大規模拡散トランスフォーマー（Flux.1）を画像キャプションなしで適応させる。LucidFluxは、軽量なデュアルブランチコンディショナーを導入し、劣化した入力からの信号と軽度に復元されたプロキシをそれぞれ注入して、幾何学を固定し、アーティファクトを抑制する。次に、タイムステップおよびレイヤー適応型の変調スケジュールを設計し、これらの手がかりをバックボーンの階層全体にルーティングすることで、グローバル構造を保護しつつテクスチャを復元する、粗から細へのコンテキスト認識型の更新を実現する。その後、テキストプロンプトやMLLMキャプションの遅延や不安定性を回避するため、プロキシから抽出されたSigLIP特徴量を用いてキャプションフリーの意味的整合性を強制する。さらに、スケーラブルなキュレーションパイプラインにより、構造豊富な教師データを大規模にフィルタリングする。合成および実世界のベンチマークにおいて、LucidFluxは強力なオープンソースおよび商用のベースラインを一貫して上回り、アブレーションスタディにより各コンポーネントの必要性が検証された。LucidFluxは、大規模なDiTにおいて、パラメータを追加したりテキストプロンプトに依存するのではなく、いつ、どこで、何を条件付けるかが、実世界におけるロバストでキャプションフリーのユニバーサル画像復元の鍵であることを示している。

English

Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics -- conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone's hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on -- rather than adding parameters or relying on text prompts -- is the governing lever for robust and caption-free universal image restoration in the wild.

LucidFlux: 大規模拡散トランスフォーマーによるキャプションフリーの汎用画像修復

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

要旨

Support