视频生成的后训练系统化框架

摘要

尽管大规模视频扩散模型已展现出生成高分辨率、高语义含量内容的卓越能力，但由于提示词敏感性、时序不一致性及过高推理成本等关键问题，其预训练性能与实际部署需求之间仍存在显著差距。为弥补这一差距，我们提出了一套全面的后训练框架，通过四个协同阶段系统性地将预训练模型与用户意图对齐：首先采用监督微调（SFT）将基础模型转化为稳定的指令遵循策略；随后通过基于人类反馈的强化学习（RLHF）阶段，利用专为视频扩散设计的创新性群组相对策略优化（GRPO）方法提升感知质量与时序连贯性；继而通过专用语言模型进行提示词增强以优化用户输入；最终通过推理优化提升系统效率。这些组件共同构成了系统化提升视觉质量、时序连贯性与指令遵循能力的方案，同时保留预训练阶段习得的可控性。该框架为构建稳定、适应性强且实际部署高效的可扩展后训练流程提供了实用蓝图。大量实验表明，这一统一流程在严格遵守采样成本限制的前提下，能有效减少常见伪影，显著提升可控性与视觉美感。

English

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.

视频生成的后训练系统化框架

A Systematic Post-Train Framework for Video Generation

摘要

Support