ChatPaper.aiChatPaper

浑元OCR技术报告

HunyuanOCR Technical Report

November 24, 2025
作者: Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang
cs.AI

摘要

本文介绍HunyuanOCR——一款商用级开源轻量级(10亿参数)光学字符识别专用视觉语言模型。该模型采用原生视觉Transformer与轻量化大语言模型架构,通过MLP适配器进行连接。HunyuanOCR展现出卓越性能,在文本定位、解析等感知任务上超越商业API、传统流水线及更大参数量模型(如Qwen3-VL-4B);在信息提取、图文翻译等语义任务中表现优异,荣获ICDAR 2025 DIMT挑战赛小模型赛道冠军。在参数量小于30亿的视觉语言模型中,该模型更是在OCRBench基准上取得了最先进的性能。 HunyuanOCR实现三大突破:1)通用性与高效性统一:在轻量级框架内完整支持定位、解析、信息提取、视觉问答及翻译等核心能力,突破专用OCR模型能力局限与通用VLM效率瓶颈;2)端到端架构革新:采用纯端到端范式摆脱对版面分析等预处理模块的依赖,从根本上解决传统流水线的误差传播问题并简化系统部署;3)数据驱动与强化学习策略:验证高质量数据的关键作用,并首次在业界证明强化学习策略可显著提升OCR任务性能。 HunyuanOCR已在HuggingFace平台开源,同时提供基于vLLM的高性能部署方案,其生产效能达到业界顶尖水平。我们期待该模型能推动前沿技术探索,并为工业应用提供坚实基础。
English
This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.
PDF183December 1, 2025