TRivia:面向表格识别的视觉语言模型自监督微调方法
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
December 1, 2025
作者: Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, Conghui He
cs.AI
摘要
表格识别(TR)旨在将表格图像转换为HTML或Markdown等半结构化表示。作为文档解析的核心组件,该技术长期依赖监督学习,近期研究主要通过标注数据微调视觉语言模型(VLM)来实现。尽管VLM将表格识别性能提升至新高度,但进一步突破需要耗费巨大成本获取大规模标注数据。这导致专有模型持续刷新性能纪录的同时,受限于资源约束且因隐私法规成为多数用户唯一可行选择的开源模型仍存在明显差距。为弥合这一鸿沟,我们提出TRivia——一种基于自监督的微调方法,使预训练VLM能够直接从真实场景的无标注表格图像中学习表格识别技术。该方法基于群体相对策略优化框架,自动识别最能促进学习效果的无标注样本,并通过问答式奖励机制消除对人工标注的依赖。其注意力引导模块为每个表格图像生成多样化问题,而模型对识别结果的解析能力和正确回答问题能力则为优化提供反馈。这种闭环学习机制使表格识别模型无需标注数据即可自主掌握表格的识别、结构化与推理能力。基于此 pipeline,我们推出TRivia-3B模型:一个开源、轻量且达到最先进水平的表格识别系统,在三大主流基准测试中超越现有系统(如Gemini 2.5 Pro、MinerU2.5)。模型与代码已发布于:https://github.com/opendatalab/TRivia
English
Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia