GlyphPrinter：面向字形精准视觉文本渲染的区域分组直接偏好优化

摘要

生成精确的视觉文本渲染字形至关重要却充满挑战。现有方法通常通过大量高质量场景文本图像训练来增强文本渲染效果，但字形变体覆盖有限和过度风格化往往会损害字形准确性，尤其对复杂或超域字符更为明显。部分方法采用强化学习缓解此问题，但其奖励模型通常依赖对细粒度字形错误不敏感的文本识别系统，导致含错误字形的图像仍可能获得高奖励。受直接偏好优化（DPO）启发，我们提出基于偏好的文本渲染方法GlyphPrinter，该方法无需显式奖励模型。然而标准DPO目标仅建模两个样本间的整体偏好，对于字形错误常出现在局部区域的视觉文本渲染任务而言尚不充分。为此，我们构建了带有区域级字形偏好标注的GlyphCorrector数据集，并提出区域分组DPO（R-GDPO）——通过标注区域优化样本间与样本内偏好的区域化目标，显著提升字形准确性。此外，我们引入区域奖励引导推理策略，通过可控制字形准确性的最优分布进行采样。大量实验表明，所提GlyphPrinter在保持风格化与精度良好平衡的同时，字形准确性优于现有方法。

English

Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.