WeEdit：一個面向文字中心影像編輯的資料集、基準與字形引導框架

摘要

基於指令的圖像編輯旨在根據使用者提供的指令修改現有圖像中的特定內容，同時保留非目標區域。相較於傳統以物件和風格為核心的操控方式，以文字為核心的圖像編輯專注於修改、翻譯或重排嵌入圖像中的文字元素。然而，現有主流模型往往難以精確執行複雜的文字編輯任務，常產生模糊或虛構的字元。我們認為這些失敗主要源於缺乏針對文字中心編輯的專用訓練範式，以及封閉迴圈訓練與評估系統所需的大規模資料集和標準化基準。為解決這些侷限，我們提出 WeEdit——一個涵蓋可擴展資料建構流程、兩套基準測試及定制化兩階段訓練策略的系統化解決方案。具體而言，我們設計了基於HTML的新型自動編輯流程，生成包含15種語言、覆蓋多樣編輯操作的33萬組訓練資料對，並配套標準化的雙語與多語言基準用於全面評估。在演算法層面，我們採用字形引導的監督微調來注入顯式空間與內容先驗知識，隨後通過多目標強化學習階段對齊生成結果與指令遵循度、文字清晰度及背景保留度。大量實驗表明，WeEdit在多種編輯操作中均以明顯優勢超越先前開源模型。

English

Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.

WeEdit：一個面向文字中心影像編輯的資料集、基準與字形引導框架

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

摘要

Support