YoChameleon：个性化视觉与语言生成系统

摘要

大型多模态模型（如GPT-4、Gemini、Chameleon）已发展成为拥有数百万用户的强大工具。然而，它们仍属于通用模型，缺乏对特定用户概念的个性化认知。先前的研究已探索了文本生成的个性化方法，但这些方法如何适应图像生成等新模态仍不明确。本文中，我们提出了Yo'Chameleon，这是首次尝试研究大型多模态模型个性化的工作。给定3-5张特定概念的图像，Yo'Chameleon利用软提示调优嵌入主题特定信息，以（i）回答关于该主题的问题，并（ii）在新情境下重现像素级细节，生成该主题的图像。Yo'Chameleon通过（i）一种自提示优化机制来平衡跨多模态的性能，以及（ii）一种“软正例”图像生成方法，在少样本设置下提升图像质量。

English

Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.

YoChameleon：个性化视觉与语言生成系统

YoChameleon: Personalized Vision and Language Generation

摘要

Support