YoChameleon：個性化視覺與語言生成

摘要

大型多模態模型（如GPT-4、Gemini、Chameleon）已發展成為擁有數百萬用戶的強大工具。然而，這些模型仍屬通用型，缺乏對特定用戶概念的個性化知識。先前的研究已探索了文本生成的個性化方法，但這些方法如何適應新的模態（如圖像生成）仍不明確。本文介紹了Yo'Chameleon，這是首次嘗試研究大型多模態模型的個性化。給定某個概念的3-5張圖像，Yo'Chameleon利用軟提示調優來嵌入特定主題的信息，以（i）回答有關該主題的問題，並（ii）重現像素級細節，在新情境下生成該主題的圖像。Yo'Chameleon通過（i）一種自提示優化機制來平衡多模態性能，以及（ii）一種「軟正例」圖像生成方法來提升少樣本設置下的圖像質量。

English

Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.

YoChameleon：個性化視覺與語言生成

YoChameleon: Personalized Vision and Language Generation

摘要

Support