ワードアート指向シーンテキスト認識の高度化：データセットと手法

要旨

WordArt（アートテキスト）は高度にカスタマイズされたフォント、テクスチャ、レイアウトを特徴とし、WordArt指向のシーンテキスト認識（WATER）は一般的なシーンテキスト認識（STR）よりもはるかに困難である。既存のSTRデータセットと手法は、通常、規則的なシーンテキストと固定テンプレート入力を想定して構築されており、WATERへのスケーリングは困難である。そこで本研究では、データ面とモデル面の両方からこのタスクを前進させることを目指す。データ面では、既存のアートテキストデータと比較して規模が数百倍に向上した200万件の合成データセットWATER-Sを構築する。WATER-Sは、互いに補完し合う2つのサブセットから構成される。1つは、改良されたレンダリングパイプライン（SynthWordArt）によって生成され、高精度で制御可能な合成WordArtデータを提供する。もう1つは、プロンプトマイニング用のQwen3-VLと画像合成用のZ-Imageを組み合わせて生成され、現実的で多様なデータのカバレッジを向上させる。モデル面では、WATERecを提案する。任意形状の入力をサポートするビジュアルエンコーダと、複雑なレイアウトをモデル化する自己回帰デコーダを採用し、WordArtにおける固定テンプレートSTRのボトルネックを構造的に打破する。実験では、このアーキテクチャが従来のSTR手法を凌駕し、WordArtなどの不規則テキストにおいて最先端の性能を達成することを示す。既存の実STRデータから注意深く再編成したWATER-Rと合わせて、新しい合成データとモデル設計による強力なベースラインは、WordArt-Benchにおいて90.40%の精度を達成し、汎用およびOCR特化の視覚言語モデルを大きく上回る。コードとデータはhttps://github.com/YesianRohn/WATER で公開されている。

English

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.