文本-視覺協同指導的影像編輯

摘要

現有的影像編輯方法大致可分為基於文字指令與基於視覺提示兩類。文字指令具有語意表達力，但其編輯結果的空間控制粒度較為粗糙。相較之下，視覺提示如拖拽與點擊能提供精確的空間指引，卻受限於語意意圖固有的模糊性。為統合文字與視覺提示的優勢，我們提出「文字-視覺協同引導的影像編輯」框架，將文字指令作為語意意圖，稀疏視覺指令作為空間指引，共同建模，以實現精確且忠於意圖的影像操作。為此，我們首先從動態影片中建構一個包含超過23,000個樣本的文字-視覺指令配對資料集，為跨模態指令提供對齊的監督訊號。接著提出TV-Edit，一個統合文字與視覺指令的編輯框架，能將拖拽或點擊式的視覺指令與影像文字語意脈絡化，並將其提升為語意感知的控制表徵，供預訓練編輯骨幹使用。透過整合語意意圖與空間約束，相較於純文字或純拖拽方法，TV-Edit能實現更精確的空間控制、更低的指令模糊性，以及更強的結構一致性。最後，我們建立TV-Edit-Bench，一個精心設計的基準測試，用以評估語意忠實度、空間對齊度，以及與真實參考圖像的視覺一致性，並透過受控的文字-視覺變化進行可靠評量。在多個編輯骨幹上的實驗顯示，TV-Edit能持續輸出更精確且忠於意圖的編輯結果，顯著優於現有的基於指令與基於拖拽的最新基準方法。

English

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.