FLUX是否已掌握如何进行物理上可信的图像合成？

摘要

图像合成旨在将用户指定的对象无缝插入到新场景中，但现有模型在处理复杂光照（如精确阴影、水面反射）以及多样化的高分辨率输入时仍面临挑战。现代文本到图像扩散模型（如SD3.5、FLUX）已编码了关键的物理和分辨率先验，但缺乏一个无需依赖潜在反演即可释放这些先验的框架，而潜在反演往往将对象姿态锁定在上下文不恰当的方向，或依赖于脆弱的注意力调整。我们提出了SHINE，一个无需训练的框架，用于实现无缝、高保真且误差中和的插入。SHINE引入了流形导向的锚点损失，利用预训练的定制适配器（如IP-Adapter）来引导潜在空间，确保主体忠实表示的同时保持背景完整性。进一步提出了退化抑制指导和自适应背景融合，以消除低质量输出和可见接缝。针对缺乏严格基准的问题，我们引入了ComplexCompo，它涵盖了多种分辨率和挑战性条件，如低光照、强照明、复杂阴影和反射表面。在ComplexCompo和DreamEditBench上的实验表明，SHINE在标准指标（如DINOv2）和人类对齐评分（如DreamSim、ImageReward、VisionReward）上均达到了最先进的性能。代码和基准将在发表后公开提供。

English

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

FLUX是否已掌握如何进行物理上可信的图像合成？

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

摘要

Support