テキストから画像への拡散モデルに対する反復的なオブジェクト数最適化

要旨

テキストから画像へのモデルにおける持続的な課題に取り組みます：特定の数のオブジェクトを正確に生成すること。画像テキストのペアから学習する現在のモデルは、訓練データが任意のオブジェクトに対してあり得るあらゆる数のオブジェクトを描写できないため、数えることに苦労しています。この課題を解決するために、オブジェクトのポテンシャルを集約する数えモデルから導かれる数え損失に基づいて生成された画像を最適化することを提案します。アウトオブザボックスの数えモデルを利用することは、2つの理由から困難です：第一に、モデルはオブジェクトの視点によって異なるポテンシャル集約のためのスケーリングハイパーパラメータが必要であり、第二に、分類器ガイダンス技術は、ノイズの多い中間拡散ステップで動作する修正されたモデルを必要とします。これらの課題に対処するために、テキスト条件付け埋め込みを変更し、動的にハイパーパラメータを調整することで、推論された画像の精度を向上させる反復オンライントレーニングモードを提案します。当社の手法は3つの主要な利点を提供します：(i) 検出モデルに基づく導出不可能な数え技術を考慮できること、(ii) カウンティング技術や画像生成方法を迅速に変更できるゼロショットのプラグアンドプレイソリューションであること、および(iii) 最適化された数えトークンを再利用して追加の最適化なしに正確な画像を生成できること。さまざまなオブジェクトの生成を評価し、精度の大幅な改善を示します。プロジェクトページはhttps://ozzafar.github.io/count_tokenで利用可能です。

English

We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object\'s potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at https://ozzafar.github.io/count_token.

テキストから画像への拡散モデルに対する反復的なオブジェクト数最適化

Iterative Object Count Optimization for Text-to-image Diffusion Models

要旨

Support