文本-视觉协同指导的图像编辑

摘要

现有图像编辑方法大致可分为基于文本指令和基于视觉提示两类。文本指令虽具备语义表达能力，但受限于编辑结果空间控制的粗粒度；而拖拽、点击等视觉提示虽能提供精确空间引导，却受限于语义意图的固有歧义性。为融合文本与视觉提示的优势，我们提出文本-视觉联合指导图像编辑方法，将文本指令作为语义意图、稀疏视觉指令作为空间引导进行联合建模，旨在实现精确且忠实于意图的图像操控。为此，我们首先基于动态视频构建包含2.3万余个样本的文本-视觉指令配对数据集，为跨模态指令提供对齐监督。随后提出TV-Edit——一个文本-视觉指令统一编辑框架，将基于拖拽或点击的视觉指令与图像文本语义进行上下文关联，并将其提升为面向预训练编辑骨干网络的语义感知控制表示。通过融合语义意图与空间约束，与纯文本或纯拖拽方法相比，TV-Edit实现了更精确的空间控制、更低的指令歧义性以及更强的结构一致性。最后，我们建立TV-Edit-Bench基准，通过包含真实参考标注与受控文本-视觉变体的精心设计，从语义忠实度、空间对齐度与视觉一致性三个维度进行可靠评估。在多种编辑骨干网络上的实验表明，TV-Edit始终能生成更精确且忠实于意图的编辑结果，显著优于当前最先进的基于指令和基于拖拽的基线方法。

English

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.