ChatPaper.aiChatPaper

RAPO++:基于数据对齐与测试时缩放的文本到视频生成跨阶段提示优化

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

October 23, 2025
作者: Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu
cs.AI

摘要

提示词设计在文本到视频(T2V)生成中至关重要,但用户提供的提示词往往简短、非结构化且与训练数据不匹配,限制了基于扩散的T2V模型的生成潜力。我们提出RAPO++,一种跨阶段提示词优化框架,通过统一训练数据对齐优化、测试时迭代缩放和大语言模型(LLM)微调,在不修改生成主干模型的情况下显著提升T2V生成质量。第一阶段采用检索增强提示词优化(RAPO),通过从关系图中检索语义相关的修饰词丰富用户提示,并重构提示以匹配训练数据分布,从而增强组合性与多对象保真度。第二阶段引入样本特定提示词优化(SSPO),该闭环机制利用多源反馈(包括语义对齐、空间保真度、时序连贯性及光流等任务特定信号)迭代优化提示词,逐步提升视频生成质量。第三阶段利用SSPO产生的优化提示词对微调重写器LLM,将任务特定优化模式内化,实现推理前即可生成高效高质量的提示词。在五个前沿T2V模型和五个基准测试上的大量实验表明,RAPO++在语义对齐、组合推理、时序稳定性和物理合理性方面取得显著提升,大幅超越现有方法。我们的研究成果确立了RAPO++作为模型无关、成本高效且可扩展的解决方案,为T2V生成领域的提示词优化树立了新标准。代码已开源:https://github.com/Vchitect/RAPO。
English
Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present RAPO++, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In Stage 1, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. Stage 2 introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. Stage 3 leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.
PDF111December 17, 2025