文本到图像扩散模型的迭代对象计数优化

摘要

我们解决了文本到图像模型中的一个持久性挑战：准确生成指定数量的对象。当前的模型从图像文本对中学习，在计数方面存在困难，因为训练数据无法展示任何给定对象的所有可能数量。为了解决这个问题，我们提出基于计数模型导出的计数损失对生成的图像进行优化，该计数模型聚合了对象的潜力。使用现成的计数模型具有挑战性，原因有两点：首先，该模型需要一个用于潜力聚合的缩放超参数，这个超参数会根据对象的视角而变化；其次，分类器指导技术需要修改的模型，这些模型在嘈杂的中间扩散步骤上运行。为了解决这些挑战，我们提出了一个迭代的在线训练模式，可以改善推断图像的准确性，同时改变文本调节嵌入并动态调整超参数。我们的方法提供了三个关键优势：(i) 它可以考虑基于检测模型的非可导计数技术，(ii) 它是一种零-shot即插即用的解决方案，便于快速更改计数技术和图像生成方法，(iii) 优化的计数令牌可以被重复使用以生成准确的图像，无需额外优化。我们评估了各种对象的生成，并展示了准确性的显著提高。项目页面位于https://ozzafar.github.io/count_token。

English

We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object\'s potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at https://ozzafar.github.io/count_token.

文本到图像扩散模型的迭代对象计数优化

Iterative Object Count Optimization for Text-to-image Diffusion Models

摘要

Support