重新思考文字轉視覺生成中推論時縮放的提示設計
Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
December 3, 2025
作者: Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz
cs.AI
摘要
在文本到視覺生成領域,實現用戶意圖與生成視覺內容的精確對齊仍是核心挑戰,因為單次生成往往難以產出符合預期的結果。為解決此問題,現有方法主要通過擴展視覺生成流程(例如增加採樣步數或種子數量)來處理,但這種方式會迅速觸及質量瓶頸。此侷限性源於引導生成的關鍵要素——提示詞——在過程中保持固定不變。為此,我們提出「推理階段縮放的提示詞重設計」(PRIS)框架,該框架能在推理階段根據擴展視覺生成的結果自適應地修訂提示詞。PRIS的核心思想是:審查已生成的視覺內容,識別跨視覺樣本中重複出現的錯誤模式,據此重新設計提示詞,再使用修訂後的提示詞重新生成視覺內容。為提供精確的對齊反饋以指導提示詞修訂,我們引入新型驗證機制「元素級事實校正」,該機制在細粒度層面評估提示詞屬性與生成視覺內容的對齊程度,相比整體性評估方法能實現更精準且可解釋的判斷。在文本到圖像和文本到視頻基準測試上的大量實驗證明了我們方法的有效性,其中在VBench 2.0上實現了15%的性能提升。這些結果凸顯了聯合縮放提示詞與視覺內容對於在推理階段充分發揮縮放定律效能的關鍵作用。視覺化結果請參見網址:https://subin-kim-cv.github.io/PRIS。
English
Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.