WeEdit：面向文本中心图像编辑的数据集、基准与字形引导框架

摘要

基于指令的图像编辑旨在根据用户提供的指令修改现有图像中的特定内容，同时保留非目标区域。相较于传统的以物体和风格为核心的操控方式，以文本为核心的图像编辑专注于修改、翻译或重排图像中嵌入的文本元素。然而，现有主流模型往往难以精确执行复杂文本编辑任务，频繁产生模糊或虚构的字符。我们认为这些缺陷主要源于缺乏针对文本编辑定制的专项训练范式，以及闭环训练与评估体系所需的大规模数据集和标准化基准的缺失。为此，我们提出了WeEdit系统解决方案，包含可扩展的数据构建流程、两项基准测试以及量身定制的两阶段训练策略。具体而言，我们设计了一种基于HTML的新型自动编辑流程，生成涵盖15种语言、33万组训练数据对，并配套标准化的双语/多语言基准用于全面评估。在算法层面，我们采用字形引导的监督微调注入显式空间与内容先验，继而通过多目标强化学习阶段对齐生成结果与指令遵循度、文本清晰度及背景保持度。大量实验表明，WeEdit在多样化编辑任务中显著优于现有开源模型。

English

Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.

WeEdit：面向文本中心图像编辑的数据集、基准与字形引导框架

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

摘要

Support