SmolDocling:一款超紧凑的视觉-语言模型,用于端到端的多模态文档转换
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
March 14, 2025
作者: Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A. Said Gurbuz, Michele Dolfi, Miquel Farré, Peter W. J. Staar
cs.AI
摘要
我们推出SmolDocling,一款超紧凑的视觉语言模型,专为端到端文档转换设计。该模型通过生成DocTags——一种全新的通用标记格式,全面处理整页文档,精准捕捉所有页面元素及其位置上下文。与依赖大型基础模型或手工构建多模型流水线的现有方法不同,SmolDocling在仅256M参数的视觉语言模型中实现了端到端转换,准确捕获文档内容、结构及元素的空间位置。SmolDocling在重现代码清单、表格、公式、图表、列表等多种文档特征方面表现出色,其应用范围广泛涵盖商业文档、学术论文、技术报告、专利及表单等,显著超越了传统仅关注科学文献的局限。此外,我们贡献了针对图表、表格、公式及代码识别的新颖公开数据集。实验结果表明,SmolDocling在性能上可与体积大至27倍的视觉语言模型相媲美,同时大幅降低计算需求。该模型现已可用,数据集也将很快公开。
English
We introduce SmolDocling, an ultra-compact vision-language model targeting
end-to-end document conversion. Our model comprehensively processes entire
pages by generating DocTags, a new universal markup format that captures all
page elements in their full context with location. Unlike existing approaches
that rely on large foundational models, or ensemble solutions that rely on
handcrafted pipelines of multiple specialized models, SmolDocling offers an
end-to-end conversion for accurately capturing content, structure and spatial
location of document elements in a 256M parameters vision-language model.
SmolDocling exhibits robust performance in correctly reproducing document
features such as code listings, tables, equations, charts, lists, and more
across a diverse range of document types including business documents, academic
papers, technical reports, patents, and forms -- significantly extending beyond
the commonly observed focus on scientific papers. Additionally, we contribute
novel publicly sourced datasets for charts, tables, equations, and code
recognition. Experimental results demonstrate that SmolDocling competes with
other Vision Language Models that are up to 27 times larger in size, while
reducing computational requirements substantially. The model is currently
available, datasets will be publicly available soon.Summary
AI-Generated Summary