InstructMix2Mix:基于多视角模型个性化的稀疏视角一致性编辑
InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
November 18, 2025
作者: Daniel Gilo, Or Litany
cs.AI
摘要
我们致力于解决稀疏输入视角下的多视图图像编辑任务,其中输入可视为从不同视角捕捉场景的图像集合。该任务的目标是根据文本指令修改场景,同时保持所有视角间的一致性。现有基于单场景神经场或时序注意力机制的方法在此设定下表现不佳,常产生伪影和不连贯的编辑效果。我们提出InstructMix2Mix(I-Mix2Mix)框架,通过将2D扩散模型的编辑能力蒸馏至预训练的多视图扩散模型,利用其数据驱动的3D先验实现跨视图一致性。核心创新在于用多视图扩散学生模型取代传统分数蒸馏采样中的神经场整合器,这需要三项新适配:跨时间步的渐进式学生模型更新、防止性能退化的专用教师噪声调度器,以及无需额外成本即可增强跨视图一致性的注意力机制改进。实验表明,I-Mix2Mix在保持单帧高质量编辑的同时,显著提升了多视图一致性。
English
We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.