ChatPaper.aiChatPaper

DreaMontage:基於任意影格引導的單次影片生成技術

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

December 24, 2025
作者: Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu
cs.AI

摘要

「一鏡到底」技術代表著電影製作中一種獨特而精妙的美學風格。然而其實際實現往往受制於高昂成本與複雜的現實約束。儘管新興的影片生成模型提供了虛擬替代方案,但現有方法通常依賴簡單的片段拼接,難以保持視覺流暢度與時間連貫性。本文提出DreaMontage——一個專為任意幀引導生成設計的完整框架,能根據用戶提供的多樣化輸入,合成無縫銜接、表現力豐富且時長靈活的一鏡到底影片。為實現此目標,我們從三個維度突破關鍵難題:(一)在DiT架構中融入輕量級的中間條件調控機制,通過有效利用基礎訓練數據的自適應調優策略,釋放強大的任意幀控制能力;(二)為提升視覺真實感與電影級表現力,我們構建高質量數據集並實施視覺表達微調階段,針對主體運動合理性與轉場流暢度等核心問題,採用定制化的DPO方案顯著提升生成內容的成功率與實用性;(三)為實現長序列生成,設計記憶效率優化的分段自回歸推理策略。大量實驗表明,我們的方法在保持計算效率的同時,能實現視覺驚豔且連貫流暢的一鏡到底效果,助力用戶將碎片化視覺素材轉化為生動統一的電影級連貫敘事體驗。
English
The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
PDF221December 26, 2025