OCRがRAGに支障をきたす：OCRが検索増強生成に与える連鎖的影響の評価

要旨

Retrieval-augmented Generation（RAG）は、外部知識を統合して大規模言語モデル（LLM）を強化し、幻覚を減少させ、再トレーニングなしに最新情報を取り入れることで知られています。RAGの重要な部分として、外部知識ベースは、非構造化PDF文書から構造化データを抽出するために光学文字認識（OCR）を使用して一般的に構築されます。しかし、OCRの予測の不完全性と構造化データの固有の不均一表現から、知識ベースには不可避的にさまざまなOCRノイズが含まれます。本論文では、RAGシステムにおけるOCRの連鎖的影響を理解するための初のベンチマークであるOHRBenchを紹介します。OHRBenchには、6つの実世界のRAGアプリケーションドメインから慎重に選択された350の非構造化PDF文書が含まれ、文書内の多モーダル要素から派生したQ&Aも含まれており、RAG用に使用される既存のOCRソリューションに挑戦します。OCRがRAGシステムに与える影響をよりよく理解するために、我々は2つの主要なOCRノイズタイプ、つまり意味ノイズと書式ノイズを特定し、摂動を適用して各OCRノイズの度合いが異なる構造化データセットを生成します。OHRBenchを使用して、まず現在のOCRソリューションの包括的な評価を行い、どれもRAGシステムのための高品質な知識ベースを構築するのに十分でないことを明らかにします。次に、これら2つのノイズタイプの影響を系統的に評価し、RAGシステムの脆弱性を示します。さらに、OCRを使用せずにVision-Language Models（VLM）をRAGシステムに採用する可能性について議論します。コード：https://github.com/opendatalab/OHR-Bench

English

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench

OCRがRAGに支障をきたす：OCRが検索増強生成に与える連鎖的影響の評価

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

要旨

Support