정확한 객체 수를 반영한 텍스트-이미지 생성: Make It Count

초록

텍스트-이미지 확산 모델의 전례 없는 성공에도 불구하고, 텍스트를 사용하여 묘사된 객체의 수를 제어하는 것은 놀라울 정도로 어려운 문제입니다. 이는 기술 문서부터 어린이 책, 요리 레시피 일러스트레이션에 이르기까지 다양한 응용 분야에서 중요합니다. 객체의 정확한 수를 생성하는 것은 근본적으로 도전적인 과제인데, 이는 생성 모델이 여러 객체가 동일하게 보이거나 겹치더라도 각 객체 인스턴스에 대한 개별적인 정체성을 유지하고, 생성 과정에서 암묵적으로 전역 계산을 수행해야 하기 때문입니다. 이러한 표현이 존재하는지 여부는 아직 알려져 있지 않습니다. 수량이 정확한 생성을 해결하기 위해, 우리는 먼저 확산 모델 내에서 객체 정체성 정보를 전달할 수 있는 특징을 식별합니다. 그런 다음 이를 사용하여 노이즈 제거 과정에서 객체 인스턴스를 분리하고 계산하며, 과도 생성 및 미달 생성을 감지합니다. 후자의 경우, 기존 객체의 레이아웃을 기반으로 누락된 객체의 모양과 위치를 모두 예측하는 모델을 훈련하여 이를 수정하고, 이를 통해 정확한 객체 수로 노이즈 제거를 안내하는 방법을 보여줍니다. 우리의 접근 방식인 CountGen은 객체 레이아웃을 결정하기 위해 외부 소스에 의존하지 않고, 확산 모델 자체의 사전 지식을 사용하여 프롬프트 및 시드에 종속적인 레이아웃을 생성합니다. 두 벤치마크 데이터셋에서 평가한 결과, CountGen은 기존 베이스라인의 수량 정확도를 크게 능가하는 것으로 나타났습니다.

English

Despite the unprecedented success of text-to-image diffusion models, controlling the number of depicted objects using text is surprisingly hard. This is important for various applications from technical documents, to children's books to illustrating cooking recipes. Generating object-correct counts is fundamentally challenging because the generative model needs to keep a sense of separate identity for every instance of the object, even if several objects look identical or overlap, and then carry out a global computation implicitly during generation. It is still unknown if such representations exist. To address count-correct generation, we first identify features within the diffusion model that can carry the object identity information. We then use them to separate and count instances of objects during the denoising process and detect over-generation and under-generation. We fix the latter by training a model that predicts both the shape and location of a missing object, based on the layout of existing ones, and show how it can be used to guide denoising with correct object count. Our approach, CountGen, does not depend on external source to determine object layout, but rather uses the prior from the diffusion model itself, creating prompt-dependent and seed-dependent layouts. Evaluated on two benchmark datasets, we find that CountGen strongly outperforms the count-accuracy of existing baselines.

정확한 객체 수를 반영한 텍스트-이미지 생성: Make It Count

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

초록

Support