プロンプト認識重み付けを用いた訓練不要マルチコンセプトLoRA合成

要旨

低ランク適応（LoRA）は、事前学習済み拡散モデルを特定の視覚的概念やスタイルに適応させることで、テキストから画像を生成する際のパーソナライゼーションを成功裏に実現している。しかし、そのようなモデルを複数概念のカスタマイズに拡張することは依然として困難である。複数のLoRA重みやその出力を単純に組み合わせると、概念間の干渉が生じやすく、その結果、視覚的品質が低下し、個々の概念の参照画像に対する忠実度が損なわれる。本論文では、複数のLoRAモジュールの出力を最適に組み合わせることで、複数概念のカスタマイズを実現する、シンプルでありながら効果的な手法を提案する。我々は、対応するプロンプトトークンから推測される、生成中の各概念の相対的重要度を活用し、プロンプトに応じた重要度重み付け戦略を採用するW-SwitchおよびW-Compositeという二つの手法を導入する。この戦略では、ターゲットプロンプト内のトリガーワードの意味的影響に応じて各LoRAに重みが付けられる。さらに、既存の定量的評価指標を拡張し、実世界の参照画像と生成画像から自動的にセグメント化された概念領域との比較を通じて、画像の忠実性とアイデンティティ保持を評価する、新しい画像ベースの類似性評価フレームワークを提案する。我々は、ComposLoRAテストベッド上で本手法を評価し、視覚的品質、アイデンティティ保持、および構成性において、既存の最先端手法に対する一貫した改善を実証する。大規模言語モデル（LLM）による評価やユーザー研究を含む質的評価は、提案手法の有効性をさらに裏付け、新たに導入された定量的な画像ベース指標とも一致する。我々のコードはhttps://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Compositionで公開されている。

English

Low-Rank Adaptation (LoRA) successfully enables personalization in text-to-image generation by adapting pre-trained diffusion models to specific visual concepts and styles. However, extending such models to multi-concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi-concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W-Switch and W-Composite, that employ a prompt-aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image-based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real-world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state-of-the-art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image-based metrics. Our code is available at https://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Composition.