ChatPaper.aiChatPaper

基于分层拆分合并的视频合成方法

Layer-Aware Video Composition via Split-then-Merge

November 25, 2025
作者: Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran
cs.AI

摘要

我们提出Split-then-Merge(StM)这一创新框架,旨在增强生成式视频合成的控制能力并解决其数据稀缺问题。与依赖标注数据集或手工规则的传统方法不同,StM将大量未标注视频分割为动态前景层与背景层,继而通过自组合方式学习动态主体与多样化场景的交互关系。该框架通过分层融合与数据增强实现可供性感知的合成,并引入保持前景保真度的身份保留损失函数,构建了具有变换感知能力的新型训练流程。实验表明,StM在定量基准测试及基于人类/VLLM的定性评估中均优于当前最先进方法。更多细节请访问项目页面:https://split-then-merge.github.io
English
We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io
PDF21December 2, 2025