GlyphPrinter: グリフ精度の高い視覚的テキストレンダリングのための領域グループ化直接選好最適化

要旨

視覚的なテキストレンダリングにおいて、正確なグリフ生成は重要であるが困難な課題である。既存手法は、高品質なシーンテキスト画像を大量に学習することでテキストレンダリングの質向上を図るが、グリフ変種の網羅性不足や過度なスタイライゼーションにより、特に複雑な文字やドメイン外文字ではグリフ精度が損なわれやすい。一部の手法は強化学習を応用してこの問題を緩和するが、それらの報酬モデルは細かいグリフ誤差に鈍感な文字認識システムに依存するため、誤ったグリフを含む画像が高評価を受ける可能性がある。Direct Preference Optimization (DPO) に着想を得て、我々は明示的な報酬モデルへの依存を排除した選好ベースのテキストレンダリング手法GlyphPrinterを提案する。しかし標準DPO目的関数は2サンプル間の全体的な選好のみをモデル化するため、グリフ誤差が局所的に生じやすい視覚的テキストレンダリングには不十分である。この問題を解決するため、我々は領域レベルでのグリフ選好注釈を付与したGlyphCorrectorデータセットを構築し、注釈領域におけるサンプル間・サンプル内選好を最適化する領域ベースの目的関数Region-Grouped DPO (R-GDPO) を提案し、グリフ精度を大幅に向上させる。さらに、グリフ精度を制御可能な最適分布からのサンプリングを行う推論戦略Regional Reward Guidanceを導入する。大規模実験により、提案するGlyphPrinterがスタイライゼーションと精度の良好なバランスを保ちつつ、既存手法をグリフ精度で上回ることを実証する。

English

Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.

GlyphPrinter: グリフ精度の高い視覚的テキストレンダリングのための領域グループ化直接選好最適化

GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

要旨

Support