ChatPaper.aiChatPaper

反思文本到视觉生成中推理时扩展的提示设计

Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

December 3, 2025
作者: Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz
cs.AI

摘要

在文本到视觉生成领域,实现用户意图与生成视觉内容之间的精准对齐始终是核心挑战,因为单次生成往往难以达到预期效果。现有方法主要通过扩展视觉生成过程(如增加采样步数或种子数量)来处理这一问题,但这会迅速导致质量瓶颈。该局限性的根源在于指导生成过程的关键要素——提示文本——始终保持固定。为此,我们提出推理时缩放提示词重构框架PRIS,该框架能在推理过程中根据扩展的视觉生成结果自适应地修订提示文本。PRIS的核心思想是通过审视已生成的视觉内容,识别跨样本的共性错误模式,进而重构提示文本,并使用修订后的提示重新生成视觉内容。为给提示修订提供精准的对齐反馈,我们引入了新型验证机制——元素级事实校正,该机制能在细粒度层面评估提示属性与生成视觉内容之间的对齐程度,相比整体性评估指标能实现更精准且可解释的判断。在文本到图像和文本到视频基准测试上的大量实验证明了我们方法的有效性,其中在VBench 2.0上实现了15%的性能提升。这些结果表明,联合优化提示文本与视觉生成是推理时充分发挥缩放定律效能的关键。可视化结果请访问:https://subin-kim-cv.github.io/PRIS。
English
Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.
PDF151December 5, 2025