透過概念提示綁定從圖像和影片生成概念
Composing Concepts from Images and Videos via Concept-prompt Binding
December 10, 2025
作者: Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao
cs.AI
摘要
視覺概念組合技術旨在將來自圖像與影片的不同元素整合為單一連貫的視覺輸出,但在精確提取視覺輸入中的複雜概念、以及靈活組合圖像與影片概念方面仍存在不足。我們提出Bind & Compose方法,透過將視覺概念與對應的提示詞符號綁定,並組合來自多來源的已綁定符號來構建目標提示,實現單次學習的靈活視覺概念組合。該方法採用分層綁定器結構,在擴散轉換器中進行跨注意力調控,將視覺概念編碼為對應提示詞符號,以實現複雜視覺概念的準確分解。為提升概念-符號綁定精度,我們設計了「多樣化吸收機制」,透過額外吸收符號在多元化提示詞訓練時消除概念無關細節的影響。為增強圖像與影片概念的相容性,我們提出「時間解耦策略」,透過雙分支綁定器結構將影片概念訓練解耦為兩個階段進行時間建模。實驗表明,本方法在概念一致性、提示詞保真度與動畫品質上均優於現有技術,為視覺創作開闢了新可能。
English
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.