参照模倣によるゼロショット画像編集

要旨

画像編集は、ユーザーからの多様な要求を考慮すると実用的でありながらも挑戦的なタスクであり、その中でも最も難しい部分の一つは、編集後の画像がどのように見えるべきかを正確に記述することです。本研究では、ユーザーがより便利に創造性を発揮できるよう支援するため、模倣編集（imitative editing）と呼ばれる新しい形式の編集を提案します。具体的には、関心のある画像領域を編集する際、ユーザーは野外の参照画像（例えば、オンラインで見つけた関連する画像）から直接インスピレーションを得ることができ、参照とソースの適合性を気にする必要がありません。この設計では、システムが参照から何を期待して編集を実行するかを自動的に把握する必要があります。この目的のために、MimicBrushと呼ばれる生成的訓練フレームワークを提案します。このフレームワークは、ビデオクリップからランダムに2つのフレームを選択し、一方のフレームの一部の領域をマスクし、もう一方のフレームの情報を使用してマスクされた領域を復元することを学習します。これにより、拡散事前分布から開発された我々のモデルは、自己教師ありの方法で別々の画像間の意味的対応を捉えることができます。我々は、様々なテストケースにおいて本手法の有効性を示し、既存の代替手法に対する優位性を実験的に示します。また、さらなる研究を促進するためのベンチマークを構築します。

English

Image editing serves as a practical yet challenging task considering the diverse demands from users, where one of the hardest parts is to precisely describe how the edited image should look like. In this work, we present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently. Concretely, to edit an image region of interest, users are free to directly draw inspiration from some in-the-wild references (e.g., some relative pictures come across online), without having to cope with the fit between the reference and the source. Such a design requires the system to automatically figure out what to expect from the reference to perform the editing. For this purpose, we propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame. That way, our model, developed from a diffusion prior, is able to capture the semantic correspondence between separate images in a self-supervised manner. We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives. We also construct a benchmark to facilitate further research.

参照模倣によるゼロショット画像編集

Zero-shot Image Editing with Reference Imitation

要旨

Support