ChatPaper.aiChatPaper

DreaMontage:基于任意帧引导的单次视频生成技术

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

December 24, 2025
作者: Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu
cs.AI

摘要

"一镜到底"技术代表着电影创作中一种独特而精妙的美学风格。然而其实践应用常受制于高昂成本与复杂的现实约束。尽管新兴视频生成模型提供了虚拟化替代方案,但现有方法通常依赖简单的片段拼接,往往难以保持视觉流畅性与时序连贯性。本文提出DreaMontage框架,该通用系统专为任意帧引导生成而设计,能够基于用户提供的多样化输入合成无缝、富有表现力且时长长的"一镜到底"视频。为实现这一目标,我们通过三个核心维度突破技术瓶颈:(i)在DiT架构中融入轻量级中间条件机制,通过采用能有效利用基础训练数据的自适应调优策略,解锁强大的任意帧控制能力;(ii)为提升视觉保真度与电影表现力,我们构建高质量数据集并实施视觉表达SFT阶段。针对主体运动合理性与转场平滑性等关键问题,采用定制化DPO方案显著提升生成内容的成功率与可用性;(iii)为实现长序列生成,设计内存高效的分段自回归推理策略。大量实验表明,我们的方法在保持计算效率的同时,能实现视觉惊艳且无缝连贯的"一镜到底"效果,助力用户将碎片化视觉素材转化为生动连贯的电影级一镜到底体验。
English
The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
PDF221December 26, 2025