FLUX는 이미 물리적으로 타당한 이미지 합성을 수행할 수 있는가?

초록

이미지 합성은 사용자가 지정한 객체를 새로운 장면에 자연스럽게 삽입하는 것을 목표로 하지만, 기존 모델들은 복잡한 조명(예: 정확한 그림자, 물 반사)과 다양한 고해상도 입력을 처리하는 데 어려움을 겪습니다. 최신 텍스트-이미지 확산 모델(예: SD3.5, FLUX)은 필수적인 물리적 및 해상도 사전 정보를 이미 인코딩하고 있지만, 이를 효과적으로 활용하기 위한 프레임워크가 부족하여 잠재 공간 역전(latent inversion)에 의존하거나 취약한 주의 수술(attention surgery)을 사용해야 하는 경우가 많습니다. 이에 우리는 SHINE(Seamless, High-fidelity Insertion with Neutralized Errors)이라는 학습이 필요 없는 프레임워크를 제안합니다. SHINE은 사전 학습된 맞춤형 어댑터(예: IP-Adapter)를 활용하여 매니폴드 주도 앵커 손실(manifold-steered anchor loss)을 도입함으로써 객체 표현의 충실도를 유지하면서 배경의 무결성을 보존합니다. 또한, 저품질 출력과 눈에 띄는 이음매를 제거하기 위해 저하 억제 가이던스(degradation-suppression guidance)와 적응형 배경 혼합(adaptive background blending)을 제안합니다. 엄격한 벤치마크의 부족을 해결하기 위해, 우리는 다양한 해상도와 낮은 조명, 강한 조명, 복잡한 그림자, 반사 표면과 같은 도전적인 조건을 포함한 ComplexCompo를 소개합니다. ComplexCompo와 DreamEditBench에서의 실험은 표준 지표(예: DINOv2)와 인간 중심 점수(예: DreamSim, ImageReward, VisionReward)에서 최신 기술 수준의 성능을 보여줍니다. 코드와 벤치마크는 출판 시 공개될 예정입니다.

English

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

FLUX는 이미 물리적으로 타당한 이미지 합성을 수행할 수 있는가?

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

초록

Support