일관된 비디오 편집을 위한 동작 인식 개념 정렬

초록

본 논문에서는 이미지 도메인의 의미론적 혼합과 비디오 간의 격차를 해소하는 학습이 필요 없는 프레임워크인 MoCA-Video(Motion-Aware Concept Alignment in Video)를 소개한다. 생성된 비디오와 사용자가 제공한 참조 이미지가 주어졌을 때, MoCA-Video는 참조 이미지의 의미론적 특징을 비디오 내 특정 객체에 주입하면서 원래의 움직임과 시각적 맥락을 보존한다. 우리의 접근 방식은 대각선 디노이징 스케줄과 클래스 불가지론적 분할을 활용하여 잠재 공간에서 객체를 탐지하고 추적하며, 혼합된 객체의 공간적 위치를 정밀하게 제어한다. 시간적 일관성을 보장하기 위해, 모멘텀 기반 의미론적 보정과 감마 잔차 노이즈 안정화를 도입하여 부드러운 프레임 전환을 달성한다. MoCA의 성능을 평가하기 위해 표준 SSIM, 이미지 수준 LPIPS, 시간적 LPIPS를 사용하며, 소스 프롬프트와 수정된 비디오 프레임 간의 시각적 변화의 일관성과 효과성을 평가하기 위해 새로운 지표인 CASS(Conceptual Alignment Shift Score)를 제안한다. 자체 구축한 데이터셋을 사용하여, MoCA-Video는 학습이나 미세 조정 없이도 현재의 베이스라인을 능가하며, 우수한 공간적 일관성, 일관된 움직임, 그리고 상당히 높은 CASS 점수를 달성한다. MoCA-Video는 확산 노이즈 궤적에서의 구조화된 조작이 제어 가능하고 고품질의 비디오 합성을 가능하게 함을 입증한다.

English

We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.

일관된 비디오 편집을 위한 동작 인식 개념 정렬

Motion-Aware Concept Alignment for Consistent Video Editing

초록

Support