基于分层拆分合并的视频合成技术

摘要

我们提出Split-then-Merge（StM）这一创新框架，旨在增强生成式视频合成的控制能力并解决其数据稀缺问题。与依赖标注数据集或手工规则的传统方法不同，StM将大量未标注视频分解为动态前景层与背景层，继而通过自组合学习动态主体与多样化场景的交互机制。该过程使模型能够掌握逼真视频生成所需的复杂组合动态特性。StM引入了具有感知变换能力的训练流程，通过多层融合与增强技术实现可供性感知的视频合成，同时采用身份保持损失函数确保前景元素在融合过程中的保真度。实验表明，StM在定量基准测试及基于人类/VLLM的定性评估中均优于当前最优方法。更多细节请访问项目页面：https://split-then-merge.github.io

English

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

基于分层拆分合并的视频合成技术

Layer-Aware Video Composition via Split-then-Merge

摘要

Support