通过概念提示绑定从图像与视频中构建概念

摘要

视觉概念组合技术旨在将图像和视频中的不同元素融合为统一连贯的视觉输出，但在从视觉输入中准确提取复杂概念、灵活组合图像与视频概念方面仍存在不足。我们提出Bind & Compose方法，通过将视觉概念与对应提示词绑定，并利用来自多源数据的已绑定提示词组合目标指令，实现单样本级的灵活视觉概念组合。该方法采用分层绑定器结构，在扩散变换器中通过交叉注意力机制将视觉概念编码为对应提示词，从而实现复杂视觉概念的精准解构。为提升概念-词汇绑定精度，我们设计了"多样化吸收机制"，通过引入辅助吸收词符在多样化提示词训练时消除概念无关细节的影响。针对图像与视频概念的兼容性问题，提出时序解耦策略，采用双分支绑定器结构将视频概念训练解耦为两个阶段以进行时序建模。实验表明，本方法在概念一致性、提示词忠实度和运动质量方面均优于现有技术，为视觉创意开启了新的可能性。

English

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

通过概念提示绑定从图像与视频中构建概念

Composing Concepts from Images and Videos via Concept-prompt Binding

摘要

Support