WordArt 중심 장면 텍스트 인식의 발전: 데이터셋과 방법

초록

WordArt(예술적 텍스트)는 고도로 사용자화된 글꼴, 텍스처, 배치를 특징으로 하여, WordArt 중심의 장면 텍스트 인식(WATER)은 일반적인 장면 텍스트 인식(STR)보다 훨씬 더 어렵습니다. 기존의 STR 데이터셋과 방법은 일반적으로 일반 장면 텍스트와 고정 템플릿 입력을 기반으로 구축되어 WATER로 확장하기 어렵습니다. 따라서 본 연구는 데이터와 모델 측면에서 이 과제를 발전시키는 것을 목표로 합니다. 데이터 측면에서는 기존 예술적 텍스트 데이터보다 규모가 수백 배 향상된 200만 개의 합성 데이터셋 WATER-S를 구축합니다. WATER-S는 두 개의 상호 보완적인 하위 집합으로 구성됩니다. 하나는 업그레이드된 렌더링 파이프라인(SynthWordArt)으로 생성되어 정확도와 제어 가능성이 높은 합성 WordArt 데이터를 제공합니다. 다른 하나는 프롬프트 마이닝을 위한 Qwen3-VL과 이미지 합성을 위한 Z-Image를 결합하여 생성되며, 현실적이고 다양한 데이터의 적용 범위를 개선합니다. 모델 측면에서는 WATERec을 제안합니다. 이는 임의 형태의 입력을 지원하는 시각적 인코더와 복잡한 배치를 모델링하는 자기회귀 디코더를 채택하여, WordArt에 대한 고정 템플릿 STR의 병목 현상을 구조적으로 해결합니다. 실험 결과, 이 아키텍처는 기존 STR 방법보다 성능이 뛰어나 WordArt와 같은 불규칙 텍스트에서 최첨단 성능을 달성합니다. 기존 실제 STR 데이터를 신중하게 재구성한 WATER-R과 함께, 새로운 합성 데이터와 모델 설계를 통한 강력한 기준 시스템은 WordArt-Bench에서 90.40%의 정확도를 달성하여 범용 및 OCR 특화 비전-언어 모델을 큰 폭으로 능가합니다. 코드와 데이터는 https://github.com/YesianRohn/WATER에서 확인할 수 있습니다.

English

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.