TRivia:面向表格识别的视觉语言模型自监督微调方法
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
December 1, 2025
作者: Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, Conghui He
cs.AI
摘要
表格识别(TR)的核心任务是将表格图像转换为HTML或Markdown等半结构化表示。作为文档解析的关键组件,该领域长期依赖监督学习,近期研究主要集中在基于标注数据对视觉语言模型(VLM)进行微调。虽然VLM已将表格识别性能提升至新高度,但进一步突破需要成本高昂的大规模标注数据。这导致尽管专有模型不断刷新性能纪录,受资源限制且因隐私法规成为多数用户唯一可行选择的开源模型仍存在明显差距。为弥补这一鸿沟,我们提出TRivia——一种基于自监督的微调方法,使预训练VLM能直接从无标注的真实场景表格图像中学习表格识别技术。该方法基于群体相对策略优化框架,可自动识别最能促进学习的无标注样本,并通过问答式奖励机制消除对人工标注的依赖。注意力引导模块为每个表格图像生成多样化问题,而模型通过正确解读识别结果并回答问题来获得优化反馈。这种闭环流程使TR模型能够无监督地自主学习表格的识别、结构化与推理能力。基于此 pipeline,我们推出TRivia-3B模型:一个开源、轻量且达到最先进水平的表格识别系统,在三大主流基准测试中超越现有系统(如Gemini 2.5 Pro、MinerU2.5)。模型与代码已发布于:https://github.com/opendatalab/TRivia
English
Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia