SmolDocling：一款超紧凑的视觉-语言模型，用于端到端的多模态文档转换

摘要

我们推出SmolDocling，一款超紧凑的视觉语言模型，专为端到端文档转换设计。该模型通过生成DocTags——一种全新的通用标记格式，全面处理整页文档，精准捕捉所有页面元素及其位置上下文。与依赖大型基础模型或手工构建多模型流水线的现有方法不同，SmolDocling在仅256M参数的视觉语言模型中实现了端到端转换，准确捕获文档内容、结构及元素的空间位置。SmolDocling在重现代码清单、表格、公式、图表、列表等多种文档特征方面表现出色，其应用范围广泛涵盖商业文档、学术论文、技术报告、专利及表单等，显著超越了传统仅关注科学文献的局限。此外，我们贡献了针对图表、表格、公式及代码识别的新颖公开数据集。实验结果表明，SmolDocling在性能上可与体积大至27倍的视觉语言模型相媲美，同时大幅降低计算需求。该模型现已可用，数据集也将很快公开。

English

We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.

SmolDocling：一款超紧凑的视觉-语言模型，用于端到端的多模态文档转换

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

摘要

Support