台风OCR:面向泰语文档提取的开放视觉语言模型
Typhoon OCR: Open Vision-Language Model For Thai Document Extraction
January 21, 2026
作者: Surapon Nonesung, Natapong Nitarach, Teetouch Jaknamon, Pittawat Taveekitworachai, Kunat Pipatanakul
cs.AI
摘要
文档提取是数字化工作流的核心环节,但现有视觉语言模型主要偏向高资源语言。泰语由于非拉丁字母的文字复杂性、缺乏显性词汇边界以及现实文档高度非结构化的特点,面临额外挑战,限制了当前开源模型的有效性。本文提出Typhoon OCR——一个专为泰英双语定制的开源文档提取视觉语言模型。该模型基于视觉语言主干网络,通过聚焦泰语的训练数据集进行微调。该数据集采用结合传统OCR、基于VLM的重构与精心设计的合成数据的多阶段构建流程开发。Typhoon OCR是能够实现文本转录、版式重建和文档级结构一致性的统一框架。最新版本Typhoon OCR V1.5作为紧凑高效的推理模型,旨在减少对元数据的依赖并简化部署。通过对财务报告、政府表格、书籍、信息图及手写文档等多元泰语文档的综合评估表明,Typhoon OCR在显著降低计算成本的同时,达到了与大型前沿专有模型相当或更优的性能。实验结果证明,开源视觉语言OCR模型能够实现泰语文档的精准文本提取与版式重建,在保持轻量级可部署特性的同时达到与专有系统相媲美的性能水平。
English
Document extraction is a core component of digital workflows, yet existing vision-language models (VLMs) predominantly favor high-resource languages. Thai presents additional challenges due to script complexity from non-latin letters, the absence of explicit word boundaries, and the prevalence of highly unstructured real-world documents, limiting the effectiveness of current open-source models. This paper presents Typhoon OCR, an open VLM for document extraction tailored for Thai and English. The model is fine-tuned from vision-language backbones using a Thai-focused training dataset. The dataset is developed using a multi-stage data construction pipeline that combines traditional OCR, VLM-based restructuring, and curated synthetic data. Typhoon OCR is a unified framework capable of text transcription, layout reconstruction, and document-level structural consistency. The latest iteration of our model, Typhoon OCR V1.5, is a compact and inference-efficient model designed to reduce reliance on metadata and simplify deployment. Comprehensive evaluations across diverse Thai document categories, including financial reports, government forms, books, infographics, and handwritten documents, show that Typhoon OCR achieves performance comparable to or exceeding larger frontier proprietary models, despite substantially lower computational cost. The results demonstrate that open vision-language OCR models can achieve accurate text extraction and layout reconstruction for Thai documents, reaching performance comparable to proprietary systems while remaining lightweight and deployable.