MERIT数据集:建模和高效渲染可解释的转录
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts
August 31, 2024
作者: I. de Rodrigo, A. Sanchez-Cuadrado, J. Boal, A. J. Lopez-Lopez
cs.AI
摘要
本文介绍了MERIT数据集,这是一个多模态(文本 + 图像 + 布局)的完全标记数据集,用于学校报告的背景下。MERIT数据集包含超过400个标签和33,000个样本,是训练在要求视觉丰富文档理解(VrDU)任务中的模型的宝贵资源。由于其本质(学生成绩报告),MERIT数据集可能以一种受控的方式包含偏见,使其成为评估语言模型(LLMs)中诱发偏见的宝贵工具。本文概述了数据集的生成流程,并突出了其在文本、视觉、布局和偏见领域的主要特点。为了展示数据集的实用性,我们提出了一个基准测试,使用标记分类模型,表明即使对于最先进的模型,该数据集也构成了一个重大挑战,并且这些模型在预训练阶段将极大受益于包含来自MERIT数据集的样本。
English
This paper introduces the MERIT Dataset, a multimodal (text + image + layout)
fully labeled dataset within the context of school reports. Comprising over 400
labels and 33k samples, the MERIT Dataset is a valuable resource for training
models in demanding Visually-rich Document Understanding (VrDU) tasks. By its
nature (student grade reports), the MERIT Dataset can potentially include
biases in a controlled way, making it a valuable tool to benchmark biases
induced in Language Models (LLMs). The paper outlines the dataset's generation
pipeline and highlights its main features in the textual, visual, layout, and
bias domains. To demonstrate the dataset's utility, we present a benchmark with
token classification models, showing that the dataset poses a significant
challenge even for SOTA models and that these would greatly benefit from
including samples from the MERIT Dataset in their pretraining phase.Summary
AI-Generated Summary