ChatPaper.aiChatPaper

KITAB-Bench:一個全面的多領域基準測試,專為阿拉伯語OCR與文件理解而設計

KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

February 20, 2025
作者: Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Ahmed, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan
cs.AI

摘要

隨著檢索增強生成(RAG)技術在文件處理中的日益普及,穩健的文字識別對於知識提取變得愈發關鍵。儘管英語及其他語言的OCR(光學字符識別)得益於龐大的數據集和成熟的基準測試,阿拉伯語OCR卻因其連寫字體、從右至左的文本流向以及複雜的排版與書法特徵而面臨獨特挑戰。我們提出了KITAB-Bench,這是一個全面的阿拉伯語OCR基準測試,填補了現有評估系統的空白。我們的基準涵蓋了9大領域和36個子領域的8,809個樣本,包括手寫文本、結構化表格以及針對商業智能的21種圖表類型的專門覆蓋。研究結果表明,現代視覺語言模型(如GPT-4、Gemini和Qwen)在字符錯誤率(CER)上平均比傳統OCR方法(如EasyOCR、PaddleOCR和Surya)高出60%。此外,我們強調了當前阿拉伯語OCR模型的顯著侷限性,特別是在PDF到Markdown的轉換中,最佳模型Gemini-2.0-Flash僅達到65%的準確率。這凸顯了準確識別阿拉伯語文本的挑戰,包括複雜字體問題、數字識別錯誤、詞語延展以及表格結構檢測的困難。本工作建立了一個嚴格的評估框架,可推動阿拉伯語文件分析方法的改進,並縮小與英語OCR技術的性能差距。
English
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.

Summary

AI-Generated Summary

PDF82February 24, 2025