MERIT數據集:建模和高效渲染可解釋的轉錄
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts
August 31, 2024
作者: I. de Rodrigo, A. Sanchez-Cuadrado, J. Boal, A. J. Lopez-Lopez
cs.AI
摘要
本文介紹了MERIT數據集,這是一個多模態(文本+圖像+版面)的完全標記數據集,用於學校報告的背景下。MERIT數據集包含超過400個標籤和33,000個樣本,是訓練在要求高的視覺豐富文檔理解(VrDU)任務中的模型的寶貴資源。由於其性質(學生成績報告),MERIT數據集可能以受控方式包含偏見,使其成為評估語言模型(LLMs)誘發偏見的寶貴工具。本文概述了數據集的生成流程,並突出了其在文本、視覺、版面和偏見領域的主要特徵。為了展示數據集的實用性,我們提出了一個基準測試,使用標記分類模型,顯示該數據集對於即時最佳模型來說是一個重大挑戰,這些模型將極大受益於在預訓練階段包含來自MERIT數據集的樣本。
English
This paper introduces the MERIT Dataset, a multimodal (text + image + layout)
fully labeled dataset within the context of school reports. Comprising over 400
labels and 33k samples, the MERIT Dataset is a valuable resource for training
models in demanding Visually-rich Document Understanding (VrDU) tasks. By its
nature (student grade reports), the MERIT Dataset can potentially include
biases in a controlled way, making it a valuable tool to benchmark biases
induced in Language Models (LLMs). The paper outlines the dataset's generation
pipeline and highlights its main features in the textual, visual, layout, and
bias domains. To demonstrate the dataset's utility, we present a benchmark with
token classification models, showing that the dataset poses a significant
challenge even for SOTA models and that these would greatly benefit from
including samples from the MERIT Dataset in their pretraining phase.Summary
AI-Generated Summary