ChatPaper.aiChatPaper

视频生成的后训练系统化框架

A Systematic Post-Train Framework for Video Generation

April 28, 2026
作者: Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li, Yijun Liu, Yuming Li, Xiaoxuan He, Mengzhao Chen, Haoyang Huang, Nan Duan, Ping Luo
cs.AI

摘要

尽管大规模视频扩散模型已展现出生成高分辨率、高语义含量内容的卓越能力,但由于提示词敏感性、时序不一致性及过高推理成本等关键问题,其预训练性能与实际部署需求之间仍存在显著差距。为弥补这一差距,我们提出了一套全面的后训练框架,通过四个协同阶段系统性地将预训练模型与用户意图对齐:首先采用监督微调(SFT)将基础模型转化为稳定的指令遵循策略;随后通过基于人类反馈的强化学习(RLHF)阶段,利用专为视频扩散设计的创新性群组相对策略优化(GRPO)方法提升感知质量与时序连贯性;继而通过专用语言模型进行提示词增强以优化用户输入;最终通过推理优化提升系统效率。这些组件共同构成了系统化提升视觉质量、时序连贯性与指令遵循能力的方案,同时保留预训练阶段习得的可控性。该框架为构建稳定、适应性强且实际部署高效的可扩展后训练流程提供了实用蓝图。大量实验表明,这一统一流程在严格遵守采样成本限制的前提下,能有效减少常见伪影,显著提升可控性与视觉美感。
English
While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.
PDF10April 30, 2026