CiteVQA: 신뢰할 수 있는 문서 지능을 위한 증거 귀속 벤치마킹

초록

멀티모달 대규모 언어 모델(MLLM)은 문서 이해를 크게 발전시켰으나, 현재의 Doc-VQA 평가는 최종 답변만 채점할 뿐 뒷받침하는 증거는 확인하지 않는다. 이러한 답변 전용 접근 방식은 중대한 실패 모드를 가린다. 모델이 정답을 도출하면서도 잘못된 구절에 근거할 수 있으며, 이는 법률, 금융, 의학과 같이 모든 결론이 특정 출처 영역으로 추적 가능해야 하는 고위험 분야에서 치명적인 위험이다. 이 문제를 해결하기 위해 우리는 모델이 각 답변과 함께 요소 수준의 경계 상자 인용을 반환하도록 요구하고, 둘을 함께 평가하는 벤치마크인 CiteVQA를 소개한다. CiteVQA는 7개 도메인과 2개 언어에 걸친 711개의 PDF에 걸쳐 1,897개의 질문으로 구성되며, 문서당 평균 40.6페이지이다. 충실도와 확장성을 보장하기 위해, 정답 인용은 마스킹 소거를 통해 중요한 증거를 식별하는 자동화된 파이프라인으로 생성되며, 이후 전문가 검토를 통해 검증된다. 평가의 핵심은 예측이 정답과 인용 영역이 모두 올바른 경우에만 인정하는 Strict Attributed Accuracy(SAA)이다. 20개의 MLLM을 분석한 결과, 속성 환각(Attribution Hallucination)이 만연함을 발견했다. 모델이 올바른 답을 생성하면서도 잘못된 영역을 인용하는 경우가 빈번하다. 가장 강력한 시스템(Gemini-3.1-Pro-Preview)은 SAA가 76.0에 불과하며, 가장 강력한 오픈소스 MLLM은 22.5에 그친다. 궁극적으로 신뢰할 수 있는 문서 지능을 위해, CiteVQA는 답변 전용 평가가 간과하는 신뢰성 격차를 드러내며, 이를 해소하는 데 필요한 도구를 제공한다. 우리의 저장소는 https://github.com/opendatalab/CiteVQA에서 이용할 수 있다.

English

Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.