需要多少幅梵高画作才能达到梵高的水平?寻找模仿的阈值
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
October 19, 2024
作者: Sahil Verma, Royi Rassin, Arnav Das, Gantavya Bhatt, Preethi Seshadri, Chirag Shah, Jeff Bilmes, Hannaneh Hajishirzi, Yanai Elazar
cs.AI
摘要
文本到图像模型是使用从互联网上爬取的图像-文本对组成的大型数据集进行训练的。这些数据集通常包括私人、受版权保护和许可的材料。在这些数据集上训练模型使其能够生成具有此类内容的图像,这可能违反版权法和个人隐私。这种现象被称为模仿 —— 生成具有与其训练图像相似度可识别的内容的图像。在这项工作中,我们研究了概念在训练数据集中的频率与模型模仿该概念的能力之间的关系。我们试图确定模型在训练了足够多实例以模仿一个概念的点 —— 模仿阈值。我们将这个问题提出为一个新问题:寻找模仿阈值(FIT),并提出了一种高效的方法,可以估计模仿阈值,而无需耗费大量成本从头开始训练多个模型。我们在两个领域 —— 人脸和艺术风格 —— 进行实验,创建了四个数据集,并评估了在两个预训练数据集上训练的三个文本到图像模型。我们的结果显示,这些模型的模仿阈值在200-600张图像的范围内,取决于领域和模型。模仿阈值可以为版权侵权索赔提供经验依据,并作为遵守版权和隐私法律的文本到图像模型开发者的指导原则。我们在https://github.com/vsahil/MIMETIC-2.git发布了代码和数据,项目网站托管在https://how-many-van-goghs-does-it-take.github.io。
English
Text-to-image models are trained using large datasets collected by scraping
image-text pairs from the internet. These datasets often include private,
copyrighted, and licensed material. Training models on such datasets enables
them to generate images with such content, which might violate copyright laws
and individual privacy. This phenomenon is termed imitation -- generation of
images with content that has recognizable similarity to its training images. In
this work we study the relationship between a concept's frequency in the
training dataset and the ability of a model to imitate it. We seek to determine
the point at which a model was trained on enough instances to imitate a concept
-- the imitation threshold. We posit this question as a new problem: Finding
the Imitation Threshold (FIT) and propose an efficient approach that estimates
the imitation threshold without incurring the colossal cost of training
multiple models from scratch. We experiment with two domains -- human faces and
art styles -- for which we create four datasets, and evaluate three
text-to-image models which were trained on two pretraining datasets. Our
results reveal that the imitation threshold of these models is in the range of
200-600 images, depending on the domain and the model. The imitation threshold
can provide an empirical basis for copyright violation claims and acts as a
guiding principle for text-to-image model developers that aim to comply with
copyright and privacy laws. We release the code and data at
https://github.com/vsahil/MIMETIC-2.git and the project's website is
hosted at https://how-many-van-goghs-does-it-take.github.io.Summary
AI-Generated Summary