ChatPaper.aiChatPaper

基于分层拆分合并的视频合成技术

Layer-Aware Video Composition via Split-then-Merge

November 25, 2025
作者: Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran
cs.AI

摘要

我们提出Split-then-Merge(StM)这一创新框架,旨在增强生成式视频合成的控制能力并解决其数据稀缺问题。与依赖标注数据集或手工规则的传统方法不同,StM将大量未标注视频分解为动态前景层与背景层,继而通过自组合学习动态主体与多样化场景的交互机制。该过程使模型能够掌握逼真视频生成所需的复杂组合动态特性。StM引入了具有感知变换能力的训练流程,通过多层融合与增强技术实现可供性感知的视频合成,同时采用身份保持损失函数确保前景元素在融合过程中的保真度。实验表明,StM在定量基准测试及基于人类/VLLM的定性评估中均优于当前最优方法。更多细节请访问项目页面:https://split-then-merge.github.io
English
We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io
PDF21December 2, 2025