FLUX是否已掌握如何進行物理上合理的圖像合成？

摘要

圖像合成旨在將用戶指定的物體無縫插入新場景中，但現有模型在處理複雜光照（如精確陰影、水面反射）和多樣化高分辨率輸入時仍面臨挑戰。現代文本到圖像擴散模型（如SD3.5、FLUX）已編碼了基本的物理和分辨率先驗，但缺乏一個框架來釋放這些能力，而無需依賴潛在反轉，這通常會將物體姿態鎖定在上下文不恰當的方向，或依賴脆弱的注意力手術。我們提出了SHINE，一個無需訓練的框架，用於實現無縫、高保真插入並中和誤差。SHINE引入了流形引導的錨點損失，利用預訓練的定制適配器（如IP-Adapter）來引導潛在變量，以忠實地表示主體，同時保持背景完整性。我們還提出了退化抑制引導和自適應背景融合，以進一步消除低質量輸出和可見的接縫。為了解決缺乏嚴格基準的問題，我們引入了ComplexCompo，該基準包含多種分辨率和具有挑戰性的條件，如低光照、強照明、複雜陰影和反射表面。在ComplexCompo和DreamEditBench上的實驗顯示，SHINE在標準指標（如DINOv2）和與人類對齊的評分（如DreamSim、ImageReward、VisionReward）上達到了最先進的性能。代碼和基準將在發表後公開提供。

English

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

FLUX是否已掌握如何進行物理上合理的圖像合成？

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

摘要

Support