FLUXは既に物理的に妥当な画像合成を行う方法を知っているのか？

要旨

画像合成は、ユーザー指定のオブジェクトを新しいシーンにシームレスに挿入することを目指しますが、既存のモデルは複雑な照明（正確な影、水面反射など）や多様で高解像度の入力に対応するのに苦労しています。現代のテキストから画像への拡散モデル（例：SD3.5、FLUX）は、すでに基本的な物理的および解像度の事前情報をエンコードしていますが、潜在空間反転に頼らずにそれらを解放するためのフレームワークが欠けています。潜在空間反転は、しばしばオブジェクトのポーズを文脈的に不適切な方向に固定したり、脆弱なアテンション手術を必要としたりします。我々は、SHINE（Seamless, High-fidelity Insertion with Neutralized Errors）というトレーニング不要のフレームワークを提案します。SHINEは、多様体誘導アンカー損失を導入し、事前学習されたカスタマイズアダプター（例：IP-Adapter）を活用して、背景の整合性を保ちながら忠実な被写体表現を導くための潜在空間をガイドします。劣化抑制ガイダンスと適応的背景ブレンディングを提案し、低品質の出力や目立つ継ぎ目をさらに排除します。厳密なベンチマークの欠如に対処するため、複雑な照明、強い照明、複雑な影、反射面などの挑戦的な条件を含む多様な解像度のComplexCompoを導入します。ComplexCompoとDreamEditBenchでの実験は、標準的なメトリクス（例：DINOv2）や人間の評価に基づくスコア（例：DreamSim、ImageReward、VisionReward）において最先端の性能を示しています。コードとベンチマークは公開時に公開されます。

English

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

FLUXは既に物理的に妥当な画像合成を行う方法を知っているのか？

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

要旨

Support