视频生成的后训练系统化框架

摘要

虽然大规模视频扩散模型已展现出生成高分辨率、高语义密度内容的卓越能力，但由于提示词敏感性、时间连贯性不足及推理成本过高等关键问题，其预训练性能与实际部署需求之间仍存在显著差距。为弥合这一差距，我们提出了一套全面的后训练框架，通过四个协同阶段系统性地将预训练模型与用户意图对齐：首先采用监督微调将基础模型转化为稳定的指令遵循策略；随后通过专为视频扩散设计的创新性群组相对策略优化方法进行人类反馈强化学习，以提升感知质量和时间连贯性；继而集成基于专用语言模型的提示增强技术来优化用户输入；最终通过推理优化解决系统效率问题。这些组件共同构成了一套系统性方案，在保持预训练可控性的同时，显著提升了视觉质量、时间连贯性和指令遵循能力。该框架为构建稳定、适应性强且实际部署高效的可扩展后训练流程提供了实用蓝图。大量实验表明，这一统一流程能有效减少常见伪影，在严格遵守采样成本限制的前提下显著提升可控性与视觉美感。

English

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.

视频生成的后训练系统化框架

A Systematic Post-Train Framework for Video Generation

摘要

Support