PixelHacker: 구조적 및 의미적 일관성을 갖춘 이미지 인페인팅

초록

이미지 인페인팅(image inpainting)은 이미지 편집과 이미지 생성 사이의 핵심 연구 분야입니다. 최신 최첨단(state-of-the-art, SOTA) 방법들은 새로운 어텐션 메커니즘, 경량화된 아키텍처, 그리고 컨텍스트 인식 모델링을 탐구하며 인상적인 성능을 보여주고 있습니다. 그러나 이러한 방법들은 복잡한 구조(예: 질감, 형태, 공간적 관계)와 의미론(예: 색상 일관성, 객체 복원, 논리적 정확성)에서 종종 어려움을 겪어 아티팩트와 부적절한 생성 결과를 초래합니다. 이러한 문제를 해결하기 위해, 우리는 잠재 카테고리 지도(latent categories guidance)라는 간단하지만 효과적인 인페인팅 패러다임을 설계하고, 이를 기반으로 PixelHacker라는 디퓨전 기반 모델을 제안합니다. 구체적으로, 우리는 먼저 전경과 배경(각각 잠재적으로 116개와 21개의 카테고리)을 주석 처리하여 1,400만 개의 이미지-마스크 쌍으로 구성된 대규모 데이터셋을 구축했습니다. 그런 다음, 두 개의 고정 크기 임베딩을 통해 잠재적인 전경과 배경 표현을 별도로 인코딩하고, 선형 어텐션을 통해 이러한 특징들을 디노이징 과정에 간헐적으로 주입합니다. 마지막으로, 우리의 데이터셋에서 사전 학습을 진행하고 오픈소스 벤치마크에서 미세 조정함으로써 PixelHacker를 얻었습니다. 광범위한 실험 결과, PixelHacker는 다양한 데이터셋(Places2, CelebA-HQ, FFHQ)에서 SOTA를 종합적으로 능가하며 구조와 의미론 모두에서 뛰어난 일관성을 보여줍니다. 프로젝트 페이지는 https://hustvl.github.io/PixelHacker에서 확인할 수 있습니다.

English

Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.

PixelHacker: 구조적 및 의미적 일관성을 갖춘 이미지 인페인팅

PixelHacker: Image Inpainting with Structural and Semantic Consistency

초록

Support