TRivia: テーブル認識のための視覚言語モデルの自己教師型ファインチューニング

要旨

表認識（TR）は、表画像をHTMLやMarkdownといった半構造化された表現に変換することを目的としています。文書解析の中核をなす技術として、TRは長年にわたり教師あり学習に依存しており、近年ではラベル付きデータを用いた視覚言語モデル（VLM）のファインチューニングが主流となっています。VLMはTRの性能を次の段階へと押し上げましたが、性能をさらに向上させるには、取得コストが高い大規模なラベル付きデータが必要となります。その結果、プロプライエタリモデルは性能限界を更新し続けている一方で、リソースが限られた環境で学習されることが多く、実際にはプライバシー規制により多くのユーザーにとって唯一の現実的な選択肢であるオープンソースモデルは、依然として大きく遅れを取っています。この差を埋めるため、我々はTRiviaを提案します。これは、事前学習済みVLMがラベルなしの実世界の表画像から直接TRを学習できるようにする自己教師ありファインチューニング手法です。Group Relative Policy Optimizationを基盤として構築されたTRiviaは、学習を最も効果的に促進するラベルなしサンプルを自動的に特定し、質問応答ベースの報酬メカニズムを通じて人手によるアノテーションを不要とします。注意機構に導かれたモジュールが各表画像に対して多様な質問を生成し、認識結果を解釈してそれらに正しく答える能力が、TRモデルを最適化するためのフィードバックを提供します。この閉ループプロセスにより、TRモデルはラベル付きデータなしで、表を認識し、構造化し、推論することを自律的に学習できます。このパイプラインを活用し、我々はTRivia-3Bを発表します。これは、オープンソースでコンパクト、かつ最先端のTRモデルであり、3つの人気ベンチマークにおいて既存システム（例：Gemini 2.5 Pro, MinerU2.5）を凌駕します。モデルとコードは以下で公開されています：https://github.com/opendatalab/TRivia

English

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia

TRivia: テーブル認識のための視覚言語モデルの自己教師型ファインチューニング

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

要旨

Support