考慮噪音的版面感知語言模型訓練
Noise-Aware Training of Layout-Aware Language Models
March 30, 2024
作者: Ritesh Sarkhel, Xiaoqi Ren, Lauro Beltrao Costa, Guolong Su, Vincent Perot, Yanan Xie, Emmanouil Koukoumidis, Arnab Nandi
cs.AI
摘要
一份視覺豐富的文件(VRD)利用視覺特徵和語言提示來傳播信息。訓練一個自定義的提取器,從文件中識別命名實體,需要大量標註在文本和視覺模式下的目標文件類型的實例。這是企業場景中的一個昂貴瓶頸,我們希望以可擴展的方式為成千上萬種不同的文件類型訓練自定義提取器。在這些情況下,對未標記的目標文件類型實例進行預訓練,然後在人工標記的實例上進行微調步驟是行不通的,因為這超出了為提取器分配的最大允許訓練時間。本文提出了一種噪聲感知訓練方法(NAT)來應對這種情況。NAT不是獲取昂貴的人工標記文件,而是利用弱標記文件以可擴展的方式訓練提取器。為了避免模型質量因嘈雜的弱標記樣本而下降,NAT估計每個訓練樣本的置信度,並在訓練過程中將其作為不確定性度量納入。我們使用NAT訓練了多個最先進的提取器模型。在許多公開可用和內部數據集上的實驗表明,NAT訓練的模型不僅在性能上更加穩健 - 在宏F1分數方面比轉移學習基線高達6%,而且更節省標籤 - 使獲得可比性能所需的人力工作量減少高達73%。
English
A visually rich document (VRD) utilizes visual features along with linguistic
cues to disseminate information. Training a custom extractor that identifies
named entities from a document requires a large number of instances of the
target document type annotated at textual and visual modalities. This is an
expensive bottleneck in enterprise scenarios, where we want to train custom
extractors for thousands of different document types in a scalable way.
Pre-training an extractor model on unlabeled instances of the target document
type, followed by a fine-tuning step on human-labeled instances does not work
in these scenarios, as it surpasses the maximum allowable training time
allocated for the extractor. We address this scenario by proposing a
Noise-Aware Training method or NAT in this paper. Instead of acquiring
expensive human-labeled documents, NAT utilizes weakly labeled documents to
train an extractor in a scalable way. To avoid degradation in the model's
quality due to noisy, weakly labeled samples, NAT estimates the confidence of
each training sample and incorporates it as uncertainty measure during
training. We train multiple state-of-the-art extractor models using NAT.
Experiments on a number of publicly available and in-house datasets show that
NAT-trained models are not only robust in performance -- it outperforms a
transfer-learning baseline by up to 6% in terms of macro-F1 score, but it is
also more label-efficient -- it reduces the amount of human-effort required to
obtain comparable performance by up to 73%.