SynFinTabs：情報およびテーブル抽出用の合成金融テーブルデータセット

要旨

文書画像からの表抽出は、困難なAI課題であり、多くのコンテンツ領域におけるラベル付きデータは入手困難です。既存の表抽出データセットは、利用可能な多数の学術論文とそれらのソースコードによる科学的な表に焦点を当てています。しかし、科学的、財務、その他の領域にまたがる表には、レイアウトや活字の違いが著しくあります。現在のデータセットには、表内に含まれる単語とその位置が欠落しており、代わりに信頼性の低いOCRに依存してこれらの特徴を抽出し、最新の自然言語処理タスクの機械学習モデルのトレーニングに使用しています。したがって、ラベル付きデータをより一般的に取得する方法が必要です。私たちはSynFinTabsを提案し、合成された財務表の大規模なラベル付きデータセットを提示します。私たちの希望は、これらの合成表を生成する方法が他の領域にも応用可能であることです。表画像から情報を抽出するモデルをトレーニングするために、抽出型質問応答タスクでトレーニングされたレイアウト大規模言語モデルであるFinTabQAを作成し、実世界の財務表を使用してモデルをテストし、最先端の生成モデルと比較し、結果について議論します。データセット、モデル、およびデータセット生成コードを一般に公開します。

English

Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.