동영상 생성을 위한 체계적 사후 훈련 프레임워크

초록

대규모 비디오 확산 모델이 고해상도 및 의미론적으로 풍부한 콘텐츠 생성에서 인상적인 능력을 보여주었지만, 프롬프트 민감도, 시간적 불일치, 과도한 추론 비용과 같은 중요한 문제들로 인해 사전 학습 성능과 실제 배포 요구사항 사이에는 상당한 격차가 존재합니다. 이러한 격차를 해소하기 위해, 우리는 사전 학습된 모델을 사용자 의도에 체계적으로 정렬시키는 포괄적인 사후 학습 프레임워크를 제안합니다. 이 프레임워크는 상호 보완적인 네 단계로 구성됩니다: 먼저 지도 미세 조정(SFT)을 사용하여 기본 모델을 안정적인 지시 따르기 정책으로 변환한 다음, 비디오 확산에 맞춰 인지적 품질과 시간적 일관성을 향상시키기 위해 새로 개발된 그룹 상대 정책 최적화(GRPO) 방법을 활용하는 인간 피드백 강화 학습(RLHF) 단계를 거칩니다. 이후에는 특화된 언어 모델을 통한 프롬프트 향상을 통해 사용자 입력을 정제하고, 마지막으로 추론 최적화를 통해 시스템 효율성을 해결합니다. 이러한 구성 요소들이 함께 작동하여 시각적 품질, 시간적 일관성, 지시 따르기 능력을 향상시키는 체계적인 접근법을 제공함과 동시에 사전 학습期間 습득된 제어 가능성을 보존합니다. 그 결과는 안정적이고 적응적이며 실제 배포에 효과적인 확장 가능한 사후 학습 파이프라인을 구축하기 위한 실용적인 청사진입니다. 광범위한 실험을 통해 이 통합 파이프라인이 일반적인 아티팩트를 효과적으로 완화하고, 엄격한 샘플링 비용 제약을 준수하면서 제어 가능성과 시각적 미학을 크게 개선함을 입증했습니다.

English

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.

동영상 생성을 위한 체계적 사후 훈련 프레임워크

A Systematic Post-Train Framework for Video Generation

초록

Support