Tekst-Visie Co-Geïnstrueerde Beeldbewerking

Samenvatting

Bestaande methoden voor beeldbewerking kunnen grofweg worden onderverdeeld in methoden op basis van tekstuele instructies en methoden op basis van visuele prompts. Tekstuele instructies zijn semantisch expressief, maar worden beperkt door de grove granulariteit van de ruimtelijke controle over de bewerkingsresultaten. Visuele prompts, zoals slepen en aanwijzen, bieden daarentegen precieze ruimtelijke sturing, maar worden beperkt door de inherente ambiguïteit van de semantische intentie. Om de sterke punten van tekstuele en visuele prompts te verenigen, presenteren we Text-Vision Co-Instructed Image Editing, dat gezamenlijk tekstuele instructies als semantische intentie en schaarse visuele instructies als ruimtelijke sturing modelleert, met als doel precieze en intentiegetrouwe beeldmanipulatie te bereiken. Hiertoe construeren we eerst een dataset met gepaarde tekstueel-visuele instructies van meer dan 23.000 samples, afgeleid van dynamische video's, wat afgestemde supervisie voor cross-modale instructies mogelijk maakt. Vervolgens introduceren we TV-Edit, een raamwerk voor geünificeerde tekstueel-visuele instructies, om drag- of point-gebaseerde visuele instructies te contextualiseren met beeld-tekstsemantiek en deze te verheffen tot semantisch bewuste controlerepresentaties voor voorgetrainde bewerkingsbackbones. Door semantische intentie en ruimtelijke beperkingen te integreren, leidt TV-Edit tot preciezere ruimtelijke controle, minder instructie-ambiguïteit en sterkere structurele consistentie dan alleen tekstgebaseerde of alleen drag-gebaseerde alternatieven. Tot slot introduceren we TV-Edit-Bench, een zorgvuldig ontworpen benchmark om semantische getrouwheid, ruimtelijke alignering en visuele consistentie te evalueren aan de hand van ground-truth referenties en gecontroleerde tekstueel-visuele variaties voor betrouwbare beoordeling. Onze experimenten met meerdere bewerkingsbackbones tonen aan dat TV-Edit consistent preciezere en intentiegetrouwere bewerkingen oplevert, en aanzienlijk beter presteert dan state-of-the-art instructiegebaseerde en drag-gebaseerde baselines.

English

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.