ChatPaper.aiChatPaper

PISCO:基于稀疏控制的精确视频实例插入

PISCO: Precise Video Instance Insertion with Sparse Control

February 9, 2026
作者: Xiangbo Gao, Renjie Li, Xinghao Chen, Yuheng Wu, Suofei Feng, Qing Yin, Zhengzhong Tu
cs.AI

摘要

人工智能视频生成领域正经历关键转型:从依赖大量提示工程与"优选"的通用生成,转向精细化可控生成与高保真后处理。在专业AI辅助影视制作中,实现精准定向修改至关重要。这一转变的核心在于视频实例插入技术——需将特定对象嵌入既有镜头的同时保持场景完整性。与传统视频编辑不同,该任务需满足多重要求:精确的时空定位、物理一致的场景交互、原始动态的真实还原,且需以最小用户操作实现。本文提出PISCO模型,这是一种支持任意稀疏关键帧控制的精准视频实例插入扩散模型。用户可指定单帧、起止帧或任意时间戳的稀疏关键帧,系统将自动传播物体外观、运动及交互特征。针对预训练视频扩散模型中稀疏条件引发的严重分布偏移,我们引入可变信息引导以实现鲁棒条件控制,采用分布保持时序掩码稳定时序生成,并结合几何感知条件实现逼真场景适配。此外构建了PISCO-Bench基准数据集,包含已验证的实例标注与配对纯净背景视频,采用参考与非参考感知指标进行评估。实验表明,在稀疏控制条件下PISCO持续优于强基线修复与视频编辑方法,且随控制信号增加呈现清晰单调的性能提升。项目页面:xiangbogaobarry.github.io/PISCO。
English
The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO allows users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps, and automatically propagates object appearance, motion, and interaction. To address the severe distribution shift induced by sparse conditioning in pretrained video diffusion models, we introduce Variable-Information Guidance for robust conditioning and Distribution-Preserving Temporal Masking to stabilize temporal generation, together with geometry-aware conditioning for realistic scene adaptation. We further construct PISCO-Bench, a benchmark with verified instance annotations and paired clean background videos, and evaluate performance using both reference-based and reference-free perceptual metrics. Experiments demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control, and exhibits clear, monotonic performance improvements as additional control signals are provided. Project page: xiangbogaobarry.github.io/PISCO.
PDF81February 14, 2026