豐富的人類反饋對於文本到圖像生成的重要性

摘要

最近的文本轉圖像（T2I）生成模型，如穩定擴散和Imagen，在根據文本描述生成高分辨率圖像方面取得了顯著進展。然而，許多生成的圖像仍然存在問題，如瑕疵/不合理性、與文本描述不一致以及美學質量低下。受到使用強化學習與人類反饋（RLHF）成功的啟發，以改進大型語言模型為目的，先前的研究收集了人類提供的分數作為對生成圖像的反饋，並訓練了一個獎勵模型來改進T2I生成。在本文中，我們通過（i）標記圖像中不合理或與文本不一致的區域，以及（ii）標註文本提示中被誤解或遺漏在圖像上的單詞，豐富了反饋信號。我們在18K個生成的圖像上收集了這樣豐富的人類反饋，並訓練了一個多模態變壓器來自動預測豐富的反饋。我們展示了預測的豐富人類反饋可以用來改進圖像生成，例如通過選擇高質量的訓練數據來微調和改進生成模型，或者通過使用預測的熱圖來創建遮罩來修復問題區域。值得注意的是，這些改進可以泛化到超出用於收集人類反饋數據的圖像生成模型（穩定擴散變體）的模型（Muse）。

English

Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants).

豐富的人類反饋對於文本到圖像生成的重要性

Rich Human Feedback for Text-to-Image Generation

摘要

Support