画像と動画からの概念構成：コンセプトプロンプト連携によるアプローチ

要旨

視覚的概念合成は、画像や動画から異なる要素を統合し、一貫性のある単一の視覚的出力を生成することを目的としているが、視覚入力から複雑な概念を正確に抽出し、画像と動画の概念を柔軟に組み合わせる点では未だ課題を残している。本論文では、ワンショットで柔軟な視覚的概念合成を可能にする「Bind & Compose」を提案する。この手法は、視覚的概念を対応するプロンプトトークンにバインドし、様々なソースからバインドされたトークンを用いて目標プロンプトを構成する。Diffusion Transformerにおけるクロスアテンション条件付けのため、階層的なバインダー構造を採用し、複雑な視覚的概念を正確に分解するために視覚的概念を対応するプロンプトトークンに符号化する。概念とトークンのバインド精度を向上させるため、多様化プロンプトを用いた訓練時に概念と無関係な詳細の影響を排除する追加の吸収トークンを用いる「多様化・吸収メカニズム」を設計した。画像と動画の概念間の互換性を高めるため、動画概念の訓練過程を二段階に分離し、時間的モデリングのためのデュアルブランチバインダー構造を用いる「時間的乖離戦略」を提示する。評価実験により、本手法が既存手法を上回る概念一貫性、プロンプト忠実度、動画品質を達成し、視覚的創造性の新たな可能性を開くことを実証する。

English

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

画像と動画からの概念構成：コンセプトプロンプト連携によるアプローチ

Composing Concepts from Images and Videos via Concept-prompt Binding

要旨

Support