推进面向艺术字体的场景文字识别：数据集与方法

摘要

艺术字（WordArt）具有高度自定义的字体、纹理和布局，这使得面向艺术字的场景文字识别（WATER）比通用场景文字识别（STR）更具挑战性。现有的STR数据集和方法通常基于常规场景文字和固定模板输入，难以扩展到WATER任务。为此，我们从数据和模型两方面推进该任务。在数据方面，我们构建了包含200万样本的合成数据集WATER-S，其规模相比现有艺术字数据提升了数百倍。WATER-S由两个互补子集构成：一个通过升级后的渲染管线（SynthWordArt）生成，提供高度精确且可控的合成艺术字数据；另一个则结合Qwen3-VL进行提示挖掘与Z-Image进行图像合成，提升了真实且多样数据的覆盖率。在模型方面，我们提出WATERec模型。该模型采用支持任意形状输入的视觉编码器与自回归解码器来建模复杂布局，从结构上突破了固定模板STR在艺术字上的瓶颈。实验表明，该架构优于以往的STR方法，在艺术字等不规则文本上达到了最先进性能。结合从现有真实STR数据中精心整理的WATER-R数据集，我们的强基线方法在新合成数据与模型设计下，在WordArt-Bench上达到90.40%的准确率，大幅超越通用型及OCR专用型视觉语言模型。代码与数据已开源至 https://github.com/YesianRohn/WATER。

English

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.