RAPO++:基于数据对齐与测试时缩放的跨阶段提示词优化方法 ——面向文生视频生成的研究
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
October 23, 2025
作者: Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu
cs.AI
摘要
提示词设计在文本到视频(T2V)生成中至关重要,然而用户提供的提示词往往简短、非结构化且与训练数据失配,这限制了基于扩散模型的T2V生成潜力。我们提出RAPO++,一个跨阶段提示词优化框架,通过统一训练数据对齐优化、测试时迭代缩放和大语言模型(LLM)微调,在不修改生成主干网络的情况下显著提升T2V生成质量。第一阶段采用检索增强提示词优化(RAPO),通过从关系图谱中检索语义相关的修饰词丰富用户提示,并重组提示结构以匹配训练数据分布,从而增强组合性与多对象保真度。第二阶段引入样本特定提示词优化(SSPO),该闭环机制利用多源反馈(包括语义对齐度、空间保真度、时间连贯性及光流等任务特定信号)迭代优化提示词,实现视频生成质量的渐进式提升。第三阶段利用SSPO产生的优化提示词对微调重写器LLM,将任务特定的优化模式内化至模型中,实现推理前即可生成高效优质的提示词。在五种前沿T2V模型和五个基准测试上的大量实验表明,RAPO++在语义对齐、组合推理、时间稳定性和物理合理性方面取得显著提升,以较大优势超越现有方法。研究成果凸显RAPO++作为模型无关、成本高效且可扩展的解决方案,为T2V生成领域的提示词优化树立了新标准。代码已开源:https://github.com/Vchitect/RAPO。
English
Prompt design plays a crucial role in text-to-video (T2V) generation, yet
user-provided prompts are often short, unstructured, and misaligned with
training data, limiting the generative potential of diffusion-based T2V models.
We present RAPO++, a cross-stage prompt optimization framework that
unifies training-data--aligned refinement, test-time iterative scaling, and
large language model (LLM) fine-tuning to substantially improve T2V generation
without modifying the underlying generative backbone. In Stage 1,
Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with
semantically relevant modifiers retrieved from a relation graph and refactors
them to match training distributions, enhancing compositionality and
multi-object fidelity. Stage 2 introduces Sample-Specific Prompt
Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts
using multi-source feedback -- including semantic alignment, spatial fidelity,
temporal coherence, and task-specific signals such as optical flow -- yielding
progressively improved video generation quality. Stage 3 leverages
optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing
task-specific optimization patterns and enabling efficient, high-quality prompt
generation even before inference. Extensive experiments across five
state-of-the-art T2V models and five benchmarks demonstrate that RAPO++
achieves significant gains in semantic alignment, compositional reasoning,
temporal stability, and physical plausibility, outperforming existing methods
by large margins. Our results highlight RAPO++ as a model-agnostic,
cost-efficient, and scalable solution that sets a new standard for prompt
optimization in T2V generation. The code is available at
https://github.com/Vchitect/RAPO.