MERIT数据集：建模和高效渲染可解释的转录

摘要

本文介绍了MERIT数据集，这是一个多模态（文本 + 图像 + 布局）的完全标记数据集，用于学校报告的背景下。MERIT数据集包含超过400个标签和33,000个样本，是训练在要求视觉丰富文档理解（VrDU）任务中的模型的宝贵资源。由于其本质（学生成绩报告），MERIT数据集可能以一种受控的方式包含偏见，使其成为评估语言模型（LLMs）中诱发偏见的宝贵工具。本文概述了数据集的生成流程，并突出了其在文本、视觉、布局和偏见领域的主要特点。为了展示数据集的实用性，我们提出了一个基准测试，使用标记分类模型，表明即使对于最先进的模型，该数据集也构成了一个重大挑战，并且这些模型在预训练阶段将极大受益于包含来自MERIT数据集的样本。

English

This paper introduces the MERIT Dataset, a multimodal (text + image + layout) fully labeled dataset within the context of school reports. Comprising over 400 labels and 33k samples, the MERIT Dataset is a valuable resource for training models in demanding Visually-rich Document Understanding (VrDU) tasks. By its nature (student grade reports), the MERIT Dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs). The paper outlines the dataset's generation pipeline and highlights its main features in the textual, visual, layout, and bias domains. To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models and that these would greatly benefit from including samples from the MERIT Dataset in their pretraining phase.