生成图解说明书

摘要

我们引入了生成插图指导的新任务，即定制化用户需求的视觉指导。我们确定了这一任务独特的期望，并通过一套自动和人工评估指标进行了形式化，旨在衡量生成结果的有效性、一致性和功效。我们结合大型语言模型（LLMs）的强大能力以及强文本到图像生成扩散模型，提出了一种名为StackedDiffusion的简单方法，它可以根据输入的文本生成这种插图指导。生成的模型明显优于基线方法和最先进的多模态LLMs；在30%的情况下，用户甚至更喜欢它而不是人工生成的文章。值得注意的是，它实现了各种新颖且令人兴奋的应用，远远超出了网络上静态文章所能提供的范围，例如根据用户个人情况提供包含中间步骤和图片的个性化指导。

English

We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.

生成图解说明书

Generating Illustrated Instructions

摘要

Support