gen2seg: 생성 모델이 일반화 가능한 인스턴스 세그멘테이션을 가능하게 함

초록

교란된 입력에서 일관된 이미지를 합성하도록 사전 학습함으로써, 생성 모델은 객체 경계와 장면 구성을 이해하는 능력을 본질적으로 학습합니다. 이러한 생성적 표현을 일반적인 지각 조직화 작업에 어떻게 재활용할 수 있을까요? 우리는 Stable Diffusion과 MAE(인코더+디코더)를 카테고리 불문 인스턴스 분할을 위해 미세 조정했으며, 이때 실내 가구와 자동차라는 제한된 객체 유형에 대해서만 인스턴스 색상화 손실을 사용했습니다. 놀랍게도, 우리의 모델은 미세 조정 과정에서 보지 못한(그리고 많은 경우 MAE의 ImageNet-1K 사전 학습에서도 보지 못한) 유형과 스타일의 객체를 정확하게 분할하는 강력한 제로샷 일반화 능력을 보여주었습니다. 우리의 최고 성능 모델은 보지 못한 객체 유형과 스타일을 평가할 때 강력한 감독을 받은 SAM에 근접한 성능을 보였으며, 미세한 구조와 모호한 경계를 분할할 때는 이를 능가했습니다. 반면, 기존의 프롬프트 가능한 분할 아키텍처나 판별적으로 사전 학습된 모델은 일반화에 실패했습니다. 이는 생성 모델이 카테고리와 도메인을 넘나드는 본질적인 그룹화 메커니즘을 학습하며, 인터넷 규모의 사전 학습 없이도 이를 전이할 수 있음을 시사합니다. 코드, 사전 학습된 모델, 데모는 우리 웹사이트에서 확인할 수 있습니다.

English

By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.

gen2seg: 생성 모델이 일반화 가능한 인스턴스 세그멘테이션을 가능하게 함

gen2seg: Generative Models Enable Generalizable Instance Segmentation

초록

Support