StyleDrop: Generación de Imágenes a partir de Texto en Cualquier Estilo

Resumen

Los modelos preentrenados de texto a imagen de gran escala sintetizan imágenes impresionantes con un uso adecuado de indicaciones textuales. Sin embargo, las ambigüedades inherentes al lenguaje natural y los efectos fuera de distribución dificultan la síntesis de estilos de imagen que aprovechen un patrón de diseño, textura o material específico. En este artículo, presentamos StyleDrop, un método que permite la síntesis de imágenes que siguen fielmente un estilo específico utilizando un modelo de texto a imagen. El método propuesto es extremadamente versátil y captura matices y detalles de un estilo proporcionado por el usuario, como esquemas de color, sombreado, patrones de diseño, y efectos locales y globales. Aprende eficientemente un nuevo estilo mediante el ajuste fino de muy pocos parámetros entrenables (menos del 1% del total de parámetros del modelo) y mejora la calidad a través de entrenamiento iterativo con retroalimentación humana o automatizada. Aún mejor, StyleDrop es capaz de ofrecer resultados impresionantes incluso cuando el usuario proporciona solo una única imagen que especifica el estilo deseado. Un estudio extenso muestra que, para la tarea de ajuste de estilo en modelos de texto a imagen, StyleDrop implementado en Muse supera convincentemente a otros métodos, incluyendo DreamBooth e inversión textual en Imagen o Stable Diffusion. Más resultados están disponibles en nuestro sitio web del proyecto: https://styledrop.github.io.

English

Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: https://styledrop.github.io

StyleDrop: Generación de Imágenes a partir de Texto en Cualquier Estilo

StyleDrop: Text-to-Image Generation in Any Style

Resumen

Support