TextPecker：通过结构性异常量化增强视觉文本渲染的奖励机制

摘要

视觉文本渲染（VTR）在文生图领域仍是关键挑战，即使先进模型也常生成存在结构异常的文字，如扭曲、模糊和错位。然而我们发现，主流多模态大模型与专业OCR模型大多无法感知这类结构异常，这为VTR评估和基于强化学习的优化形成了关键瓶颈。因此，即使顶尖生成器（如SeedDream4.0、Qwen-Image）仍难以渲染结构保真的文本。针对此问题，我们提出TextPecker——一种即插即用的结构异常感知强化学习策略，可缓解噪声奖励信号干扰，并适配任意文生图生成器。为实现该能力，我们构建了带有字符级结构异常标注的识别数据集，并开发笔画编辑合成引擎以扩展结构错误覆盖范围。实验表明，TextPecker能持续提升多样化的文生图模型性能；即便在已充分优化的Qwen-Image上，其中文文本渲染的结构保真度平均提升4%，语义对齐度显著提高8.7%，创下高保真VTR的新标杆。本研究填补了VTR优化领域的空白，为实现可靠且结构保真的视觉文本生成奠定了基石。

English

Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

TextPecker：通过结构性异常量化增强视觉文本渲染的奖励机制

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

摘要

Support