CoRe: テキストから画像への個人化のためのコンテキスト正則化テキスト埋め込み学習

要旨

最近のテキストから画像へのパーソナライゼーションの進歩により、ユーザーが提供した概念に対する高品質かつ制御可能な画像合成が実現されています。しかし、既存の手法は依然として、アイデンティティの保存とテキストの整合性のバランスを取るのに苦労しています。当社のアプローチは、プロンプトに整合した画像を生成するためには、プロンプトの正確な意味理解が必要であり、これにはCLIPテキストエンコーダ内の新しい概念とその周囲のコンテキストトークンとの相互作用を正確に処理することが含まれるという点に基づいています。この課題に対処するため、新しい概念をテキストエンコーダの入力埋め込み空間に適切に埋め込むことで、既存のトークンとのシームレスな統合を可能にします。私たちは、新しい概念のテキスト埋め込みの学習を強化するために、プロンプト内のそのコンテキストトークンを正則化するContext Regularization（CoRe）を導入しています。これは、新しい概念のテキスト埋め込みが正しく学習されている場合にのみ、コンテキストトークンのためのテキストエンコーダの適切な出力ベクトルが達成できるという洞察に基づいています。CoReは、対応する画像の生成を必要とせずに任意のプロンプトに適用でき、学習されたテキスト埋め込みの汎化性能を向上させます。さらに、CoReは、特定のプロンプトに対する生成をさらに向上させるためのテスト時最適化手法として機能します。包括的な実験により、当社の手法がアイデンティティの保存とテキストの整合性の両方でいくつかのベースライン手法を上回ることが示されています。コードは公開されます。

English

Recent advances in text-to-image personalization have enabled high-quality and controllable image synthesis for user-provided concepts. However, existing methods still struggle to balance identity preservation with text alignment. Our approach is based on the fact that generating prompt-aligned images requires a precise semantic understanding of the prompt, which involves accurately processing the interactions between the new concept and its surrounding context tokens within the CLIP text encoder. To address this, we aim to embed the new concept properly into the input embedding space of the text encoder, allowing for seamless integration with existing tokens. We introduce Context Regularization (CoRe), which enhances the learning of the new concept's text embedding by regularizing its context tokens in the prompt. This is based on the insight that appropriate output vectors of the text encoder for the context tokens can only be achieved if the new concept's text embedding is correctly learned. CoRe can be applied to arbitrary prompts without requiring the generation of corresponding images, thus improving the generalization of the learned text embedding. Additionally, CoRe can serve as a test-time optimization technique to further enhance the generations for specific prompts. Comprehensive experiments demonstrate that our method outperforms several baseline methods in both identity preservation and text alignment. Code will be made publicly available.

CoRe: テキストから画像への個人化のためのコンテキスト正則化テキスト埋め込み学習

CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization

要旨

Support