ChatPaper.aiChatPaper

ViDoRe V3:复杂现实场景下检索增强生成技术的全面评估

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

January 13, 2026
作者: António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, Gautier Viaud
cs.AI

摘要

檢索增強生成(RAG)管道需應對的挑戰遠超簡單的單文檔檢索,例如解讀視覺元素(表格、圖表、圖像)、跨文檔信息融合以及提供精準的來源追溯。現有基準測試未能捕捉這種複雜性,往往侷限於文本數據、單文檔理解,或孤立地評估檢索與生成環節。我們推出第三代視覺文檔檢索基準ViDoRe v3,這是一個涵蓋多模態RAG的綜合性基準,其特點是對視覺豐富文檔集進行多類型查詢。該基準覆蓋10個不同專業領域的數據集,包含約26,000個文檔頁面與3,099條人工校驗查詢的配對,每條查詢均提供6種語言版本。通過12,000小時的人工標註工作,我們為檢索相關性、邊界框定位及驗證參考答案提供了高質量標註。對前沿RAG管道的評估表明:視覺檢索器性能優於文本檢索器,延遲交互模型與文本重排序能顯著提升效果,混合或純視覺上下文可增強答案生成質量。然而現有模型仍難以處理非文本元素、開放式查詢及細粒度視覺定位。為推動相關挑戰的攻關,本基準已通過商業友好許可協議在https://hf.co/vidore發布。
English
Retrieval-Augmented Generation (RAG) pipelines must address challenges beyond simple single-document retrieval, such as interpreting visual elements (tables, charts, images), synthesizing information across documents, and providing accurate source grounding. Existing benchmarks fail to capture this complexity, often focusing on textual data, single-document comprehension, or evaluating retrieval and generation in isolation. We introduce ViDoRe v3, a comprehensive multimodal RAG benchmark featuring multi-type queries over visually rich document corpora. It covers 10 datasets across diverse professional domains, comprising ~26,000 document pages paired with 3,099 human-verified queries, each available in 6 languages. Through 12,000 hours of human annotation effort, we provide high-quality annotations for retrieval relevance, bounding box localization, and verified reference answers. Our evaluation of state-of-the-art RAG pipelines reveals that visual retrievers outperform textual ones, late-interaction models and textual reranking substantially improve performance, and hybrid or purely visual contexts enhance answer generation quality. However, current models still struggle with non-textual elements, open-ended queries, and fine-grained visual grounding. To encourage progress in addressing these challenges, the benchmark is released under a commercially permissive license at https://hf.co/vidore.
PDF71January 15, 2026