CoDeF: Inhaltsdeformationsfelder für zeitlich konsistente Videoverarbeitung

papers.abstract

Wir stellen das Inhaltsdeformationsfeld CoDeF als eine neue Art der Videodarstellung vor, das aus einem kanonischen Inhaltsfeld besteht, das die statischen Inhalte des gesamten Videos aggregiert, und einem zeitlichen Deformationsfeld, das die Transformationen vom kanonischen Bild (d.h., gerendert aus dem kanonischen Inhaltsfeld) zu jedem einzelnen Frame entlang der Zeitachse aufzeichnet. Für ein gegebenes Zielvideo werden diese beiden Felder gemeinsam optimiert, um es durch eine sorgfältig angepasste Rendering-Pipeline zu rekonstruieren. Wir führen gezielt einige Regularisierungen in den Optimierungsprozess ein, um das kanonische Inhaltsfeld dazu zu bewegen, Semantik (z.B. die Objektform) aus dem Video zu übernehmen. Mit einem solchen Design unterstützt CoDeF auf natürliche Weise die Übertragung von Bildalgorithmen zur Videoverarbeitung, in dem Sinne, dass man einen Bildalgorithmus auf das kanonische Bild anwenden und die Ergebnisse mit Hilfe des zeitlichen Deformationsfelds mühelos auf das gesamte Video übertragen kann. Wir zeigen experimentell, dass CoDeF in der Lage ist, Bild-zu-Bild-Übersetzung in Video-zu-Video-Übersetzung und die Erkennung von Schlüsselpunkten in die Verfolgung von Schlüsselpunkten ohne jegliches Training zu übertragen. Noch wichtiger ist, dass wir dank unserer Übertragungsstrategie, die die Algorithmen auf nur einem Bild einsetzt, eine überlegene konsistente Rahmenübergreifende Konsistenz in verarbeiteten Videos im Vergleich zu bestehenden Video-zu-Video-Übersetzungsansätzen erreichen und sogar in der Lage sind, nicht starre Objekte wie Wasser und Smog zu verfolgen. Die Projektseite finden Sie unter https://qiuyu96.github.io/CoDeF/.

English

We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at https://qiuyu96.github.io/CoDeF/.

CoDeF: Inhaltsdeformationsfelder für zeitlich konsistente Videoverarbeitung

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

papers.abstract

Support