텍스트에서 이미지로 확산 모델을 위한 반복적인 객체 수 최적화

초록

텍스트에서 이미지로 모델의 지속적인 과제를 다룹니다: 특정 개수의 객체를 정확하게 생성하는 것입니다. 현재 모델들은 이미지-텍스트 쌍에서 학습하지만, 훈련 데이터가 주어진 객체에 대해 모든 가능한 객체 수를 묘사할 수 없기 때문에 계산에 어려움을 겪습니다. 이를 해결하기 위해 우리는 객체의 잠재력을 집계하는 계산 모델에서 파생된 계산 손실에 기반한 생성된 이미지를 최적화하는 것을 제안합니다. 기본적인 계산 모델을 사용하는 것은 두 가지 이유로 어려운데, 첫째로, 모델은 객체의 시각에 따라 다양한 잠재력 집계를 위한 스케일링 하이퍼파라미터가 필요하며, 둘째로, 분류기 지침 기술은 잡음이 있는 중간 확산 단계에서 작동하는 수정된 모델을 필요로 합니다. 이러한 도전에 대응하기 위해 우리는 텍스트 조건 임베딩을 변경하고 하이퍼파라미터를 동적으로 조정하면서 추론된 이미지의 정확도를 향상시키는 반복적인 온라인 훈련 모드를 제안합니다. 우리의 방법은 세 가지 주요 장점을 제공합니다: (i) 감지 모델을 기반으로 한 유도 불가능한 계산 기술을 고려할 수 있습니다, (ii) 계산 기술과 이미지 생성 방법에 빠르게 변경을 용이하게 하는 제로샷 플러그 앤 플레이 솔루션입니다, (iii) 최적화된 계산 토큰은 추가적인 최적화 없이 정확한 이미지를 생성하기 위해 재사용될 수 있습니다. 우리는 다양한 객체의 생성을 평가하고 정확도에서 상당한 향상을 보여줍니다. 프로젝트 페이지는 https://ozzafar.github.io/count_token에서 확인할 수 있습니다.

English

We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object\'s potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at https://ozzafar.github.io/count_token.

텍스트에서 이미지로 확산 모델을 위한 반복적인 객체 수 최적화

Iterative Object Count Optimization for Text-to-image Diffusion Models

초록

Support