VLM導向的自適應負面提示創意生成
VLM-Guided Adaptive Negative Prompting for Creative Generation
October 12, 2025
作者: Shelly Golan, Yotam Nitzan, Zongze Wu, Or Patashnik
cs.AI
摘要
創造性生成是合成新穎、令人驚奇且具價值樣本的過程,這些樣本反映了用戶意圖,卻無法事先預見。此任務旨在擴展人類的想象力,使我們能夠發現存在於熟悉領域之間未探索空間中的視覺概念。儘管文本到圖像擴散模型在渲染與用戶提示精確匹配的逼真場景方面表現出色,但它們在生成真正新穎內容方面仍面臨挑戰。現有提升生成創造力的方法,要么依賴於圖像特徵的插值,這將探索限制在預定義的類別中;要么需要耗時的過程,如嵌入優化或模型微調。我們提出了VLM引導的自適應負向提示法,這是一種無需訓練、在推理階段應用的方法,旨在促進創造性圖像生成的同時,保持生成對象的有效性。我們的方法利用視覺語言模型(VLM)分析生成過程中的中間輸出,並自適應地引導其遠離傳統視覺概念,從而鼓勵新穎且令人驚奇的輸出產生。我們通過新穎性和有效性來評估創造力,並在CLIP嵌入空間中使用統計指標進行衡量。通過大量實驗,我們展示了在創造性新穎性方面的一致提升,且計算開銷可忽略不計。此外,與現有方法主要生成單一對象不同,我們的方法擴展至複雜場景,如生成一組連貫的創造性對象,並在精細的構圖提示中保持創造力。我們的方法無縫集成到現有的擴散管道中,為生產超越文本描述限制的創造性輸出提供了一條實用途徑。
English
Creative generation is the synthesis of new, surprising, and valuable samples
that reflect user intent yet cannot be envisioned in advance. This task aims to
extend human imagination, enabling the discovery of visual concepts that exist
in the unexplored spaces between familiar domains. While text-to-image
diffusion models excel at rendering photorealistic scenes that faithfully match
user prompts, they still struggle to generate genuinely novel content. Existing
approaches to enhance generative creativity either rely on interpolation of
image features, which restricts exploration to predefined categories, or
require time-intensive procedures such as embedding optimization or model
fine-tuning. We propose VLM-Guided Adaptive Negative-Prompting, a
training-free, inference-time method that promotes creative image generation
while preserving the validity of the generated object. Our approach utilizes a
vision-language model (VLM) that analyzes intermediate outputs of the
generation process and adaptively steers it away from conventional visual
concepts, encouraging the emergence of novel and surprising outputs. We
evaluate creativity through both novelty and validity, using statistical
metrics in the CLIP embedding space. Through extensive experiments, we show
consistent gains in creative novelty with negligible computational overhead.
Moreover, unlike existing methods that primarily generate single objects, our
approach extends to complex scenarios, such as generating coherent sets of
creative objects and preserving creativity within elaborate compositional
prompts. Our method integrates seamlessly into existing diffusion pipelines,
offering a practical route to producing creative outputs that venture beyond
the constraints of textual descriptions.