프롬프트 인식 가중치를 활용한 훈련 없는 다중 개념 LoRA 구성

초록

저랭크 적응(LoRA)은 사전 학습된 확산 모델을 특정 시각적 개념과 스타일에 적응시킴으로써 텍스트-이미지 생성에서 개인화를 성공적으로 가능하게 한다. 그러나 이러한 모델을 다중 개념 맞춤화로 확장하는 것은 여전히 어려운 과제이다. 여러 LoRA 가중치 또는 그 출력을 단순히 결합하면 개념 간 간섭이 발생하여 시각적 품질이 저하되고 개별 개념의 참조 이미지에 대한 충실도가 감소하는 경우가 많다. 본 논문에서는 여러 LoRA 모듈의 출력을 최적으로 결합하는 간단하면서도 효과적인 다중 개념 맞춤화 접근법을 제안한다. 생성 과정에서 각 개념의 상대적 중요성을 활용하며, 이는 해당 프롬프트 토큰으로부터 추론된다. W-Switch와 W-Composite이라는 두 가지 방법을 도입하는데, 이들은 프롬프트 인식 중요도 가중치 전략을 사용하여 각 LoRA가 대상 프롬프트 내 트리거 단어의 의미론적 영향에 따라 가중치가 부여되도록 한다. 또한, 기존 정량적 평가 지표를 확장하여 실제 참조 이미지와 생성된 이미지에서 자동으로 분할된 개념 영역 간의 비교를 통해 이미지 충실도와 정체성 보존을 평가하는 새로운 이미지 기반 유사도 평가 프레임워크를 제안한다. ComposLoRA 테스트베드에서 접근법을 평가한 결과, 시각적 품질, 정체성 보존 및 구성성 측면에서 기존 최신 방법들에 비해 일관된 개선을 보여주었다. 대규모 언어 모델(LLM) 기반 평가와 사용자 연구를 포함한 정성적 평가는 제안된 방법의 효과성을 추가로 검증하며, 새로 도입된 정량적 이미지 기반 지표와 일치함을 보여준다. 코드는 https://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Composition에서 확인할 수 있다.

English

Low-Rank Adaptation (LoRA) successfully enables personalization in text-to-image generation by adapting pre-trained diffusion models to specific visual concepts and styles. However, extending such models to multi-concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi-concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W-Switch and W-Composite, that employ a prompt-aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image-based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real-world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state-of-the-art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image-based metrics. Our code is available at https://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Composition.